[Be advised that this is the newest page on my site and is still being written!]

Importing Data

Due to its open-source nature, R can handle virtually any type of data that exists.

Small Data with vroom

Since everyone knows about readr, readxl, haven and data.table, I’d like to suggest reading up on the impressive vroom package.

## Rows: 32
## Columns: 12
## Delimiter: ","
## chr [ 1]: model
## dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
## # A tibble: 32 x 12
##    model         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2 Mazda RX4 ~  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4 Hornet 4 D~  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5 Hornet Spo~  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ... with 22 more rows

Medium Data with etl and dbplyr

The etl package allows R users to populate a local SQL database and analyze data using the familiar dplyr verbs.

I am vastly more skilled in wrangling data within the tidyverse framework than in SQL. Luckily, R can connect to almost any type of SQL database via the amazing DBI package, and the dbplyr package can convert dplyr code to SQL queries.

Big Data with sparklyr and h2o

As one might imagine, social science academics rarely, if ever, encounter truly big data (I know I never did when completing my MA in political science or PhD in public administration). Cluster computing is simply unnecessary in that world. Consequently, I am only familiar with packages that are well-known amongst data scientists, such as sparklyr (R’s interface for Apache Spark) and h2o (R’s interface for H2O.ai’s platform).

Modeling with tidymodels

Coming soon!

Regression and Classification Random Forests with ranger

Coming soon!

Gradient Boosting with xgboost

Coming soon!