Importing Data

Due to its open-source nature, R can handle virtually any type of data that exists.

Small Data with `vroom`

Since everyone knows about readr, readxl, haven and data.table, I’d like to suggest reading up on the impressive vroom package.

library(magrittr) # Give me %>% or give me death.
library(vroom)
vroom_example("mtcars.csv") %>% vroom()

## Rows: 32
## Columns: 12
## Delimiter: ","
## chr [ 1]: model
## dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
## 
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message

## # A tibble: 32 x 12
##    model         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2 Mazda RX4 ~  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4 Hornet 4 D~  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5 Hornet Spo~  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ... with 22 more rows

Medium Data with `etl` and `dbplyr`

The etl package allows R users to populate a local SQL database and analyze data using the familiar dplyr verbs.

I am vastly more skilled in wrangling data within the tidyverse framework than in SQL. Luckily, R can connect to almost any type of SQL database via the amazing DBI package, and the dbplyr package can convert dplyr code to SQL queries.

Big Data with `sparklyr` and `h2o`

As one might imagine, social science academics rarely, if ever, encounter truly big data (I know I never did when completing my MA in political science or PhD in public administration). Cluster computing is simply unnecessary in that world. Consequently, I am only familiar with packages that are well-known amongst data scientists, such as sparklyr (R’s interface for Apache Spark) and h2o (R’s interface for H2O.ai’s platform).

R for Machine Learning

Importing Data

Small Data with `vroom`

Medium Data with `etl` and `dbplyr`

Big Data with `sparklyr` and `h2o`

Modeling with `tidymodels`

Regression and Classification Random Forests with `ranger`

Gradient Boosting with `xgboost`

R for Machine Learning

Importing Data

Small Data with vroom

Medium Data with etl and dbplyr

Big Data with sparklyr and h2o

Modeling with tidymodels

Regression and Classification Random Forests with ranger

Gradient Boosting with xgboost

Small Data with `vroom`

Medium Data with `etl` and `dbplyr`

Big Data with `sparklyr` and `h2o`

Modeling with `tidymodels`

Regression and Classification Random Forests with `ranger`

Gradient Boosting with `xgboost`