• Big data in R
  • 1 Introduction
  • 2 Recommended resources
    • 2.1 Resources for handling big data in R
    • 2.2 Resources for the data.table package
    • 2.3 Resources for measuring R performance
    • 2.4 Resources for PostGIS database
    • 2.5 Resources for Alteryx software
    • 2.6 Datasets
    • 2.7 Extensions
  • 3 Getting started
    • 3.1 Installing packages
    • 3.2 Loading packages
    • 3.3 Install PostGIS and Alteryx software
      • 3.3.1 Install PostGIS
      • 3.3.2 Install Alteryx
  • 4 Working with big data in R
    • 4.1 Read in CSV files
      • 4.1.1 Read one large CSV file
      • 4.1.2 Fast reading multiple EPC csv files together in R
    • 4.2 Basic larger dataset munging/wrangling
      • 4.2.1 Select columns
      • 4.2.2 Changing column names to lower case or upper case
      • 4.2.3 Filter rows based on conditions
      • 4.2.4 Add in the ID column
      • 4.2.5 Convert datatable values to uppercase
      • 4.2.6 Delete a column
      • 4.2.7 Remove Duplicates
      • 4.2.8 Write files
      • 4.2.9 Bind datasets
    • 4.3 Work with PostGIS database in R
      • 4.3.1 Write files to PostGIS
      • 4.3.2 Read files from PostGIS
    • 4.4 Measure code performance
      • 4.4.1 Measure running time of the code
      • 4.4.2 profvis- an interactive profile visualizations
    • 4.5 Execute R code in Alteryx
  • Q & A
    • Thanks for your listening!
  • Published with bookdown

Working with large datasets in R

2 Recommended resources

2.1 Resources for handling big data in R

  • Handling large data sets in R

Notes:

Medium sized datasets (< 2 GB):loaded in R ( within memory limit but processing is cumbersome (typically in the 1-2 GB range )

Large files that cannot be loaded in R due to R / OS limitations

  • Large files(2 - 10 GB):process locally using some work around solutions

  • Very Large files( > 10 GB):needs distributed large scale computing.

  • Five ways to handle Big Data in R

Notes:

Rule of thumb:

Data sets that contain up to one million records can easily processed with standard R.

Data sets with about one million to one billion records can also be processed in R, but need some additional effort.

Data sets that contain more than one billion records need to be analyzed by map reduce algorithms.

  • Efficient data carpentry

  • Getting Started With Parallel Programming In R

  • Efficiency in Joining Two Data Frames

  • For large tables in R dplyr’s function inner_join() is much faster than merge()

  • BASE R, THE TIDYVERSE, AND DATA.TABLE: A COMPARISON OF R DIALECTS TO WRANGLE YOUR DATA

  • Speed comparison of rbind, bind_rows, and rbindlist

  • dplyr backends: multidplyr 0.1.0, dtplyr 1.1.0, dbplyr 2.1.0

rbindlist() is the fastest method and rbind() is the slowest. bind_rows() is half as fast as rbindlist()

2.2 Resources for the data.table package

  • Data Transformation with data.table : : CHEAT SHEET

  • A data.table and dplyr tour

  • Data.Table – everything you need to know to get you started in R

  • Blazing Fast Data Wrangling With R data.table

  • data.table in R – The Complete Beginners Guide

  • Advanced tips and tricks with data.table

  • Advanced-Data Wrangling In R — 4

  • R : DATA.TABLE TUTORIAL (WITH 50 EXAMPLES)

2.3 Resources for measuring R performance

  • 5 ways to measure running time of R code

  • Measuring performance

  • Efficient optimisation

  • Profvis — Interactive Visualizations for Profiling R Code

  • Strategies to Speedup R Code

  • R Code Optimizer

  • Strategies to Speedup R Code

  • R Performance Tuning | Learn Tips to Improve Speed & Memory of R Programs

2.4 Resources for PostGIS database

  • About PostGIS

  • Install Postgres/PostGIS and get started with spatial SQL

  • Enabling PostGIS

PostGIS is an optional extension that must be enabled in each database you want to use it in before you can use it.

2.5 Resources for Alteryx software

  • What is Alteryx?

  • Data Cleansing in Alteryx for Beginners

  • Integrating R in Alteryx

  • Alteryx - Bring Your Own R Code

  • How do I Import & Union Multiple Excel Files with Alteryx

  • The Union tool in Alteryx

  • The unique tool in Alteryx

  • How to Connect Alteryx to PostgreSQL

2.6 Datasets

  • Research data

  • Energy Performance of Buildings Data: England and Wales

  • Land Registry Price Paid Data(PPD)

  • A new attribute-linked residential property price dataset for England and Wales 2011-2019

  • Scottish Energy Performance Certificate Register

  • Scottish Domestic Energy Performance Certificates

  • Scottish Non-domestic Energy Performance Certificates

2.7 Extensions

  • Big Data: Wrangling 4.6M Rows with dtplyr

  • Big Data Analytics with R

  • R and Hadoop Data Analytics - RHadoop

  • R and Hadoop: Step-by-step tutorials

  • Mastering Spark with R

  • Hadoop Vs Spark - Detailed Comparison

  • GOOGLE