Linear Regression
Formal Checks of OLS Model Assumptions
Visual Checks of Model Assumptions
Logistic Regression
Poisson/Negative Binomial Regressions
Fixed-Effects Panel Regressions
Cox Proportional Hazards Model
Multiple Imputation
Robust Standard Errors
Ordinal Logit Regression
Multinomial Logistic Regression
Instrumental Variable Regression by Two-Stage Least Squares
Multilevel Mixed-Effects Models
Regression Table Output

Linear Regression

ols_model = lm(dv ~ iv1 + iv2 + iv3, data = df_goes_here)
summary(ols_model)

Use a period if you want every variable (except the dependent variable specified) to be used as a covariate.

ols_model = lm(rating ~ ., data = attitude)
summary(ols_model)

## 
## Call:
## lm(formula = rating ~ ., data = attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9418  -4.3555   0.3158   5.5425  11.5990 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.78708   11.58926   0.931 0.361634    
## complaints   0.61319    0.16098   3.809 0.000903 ***
## privileges  -0.07305    0.13572  -0.538 0.595594    
## learning     0.32033    0.16852   1.901 0.069925 .  
## raises       0.08173    0.22148   0.369 0.715480    
## critical     0.03838    0.14700   0.261 0.796334    
## advance     -0.21706    0.17821  -1.218 0.235577    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.068 on 23 degrees of freedom
## Multiple R-squared:  0.7326, Adjusted R-squared:  0.6628 
## F-statistic:  10.5 on 6 and 23 DF,  p-value: 1.24e-05

Formal Checks of OLS Model Assumptions

The performance package, also part of the easystats project, includes several easy commands to check OLS model assumptions.

library(performance)
check_collinearity(ols_model)

## # A tibble: 6 x 3
##   Parameter    VIF SE_factor
##   <chr>      <dbl>     <dbl>
## 1 complaints  2.67      1.63
## 2 privileges  1.60      1.27
## 3 learning    2.27      1.51
## 4 raises      3.08      1.75
## 5 critical    1.23      1.11
## 6 advance     1.95      1.40

check_normality(ols_model)

## OK: Residuals appear as normally distributed (p = 0.383).

check_heteroscedasticity(ols_model)

## OK: Error variance appears to be homoscedastic (p = 0.153).

Visual Checks of Model Assumptions

The formal checks show that this model is fine, but visual checks help when learning OLS.

performance::check_model(ols_model, panel = F)

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 30 rows containing missing values (geom_text).

Logistic Regression

logit_model = glm(dv ~ iv1 + iv2 + iv3, family = binomial(link = "logit"), data = df_goes_here)
summary(logit_model)

Using sample data from the ISLR package:

bivariate = ISLR::Default
tibble::glimpse(bivariate)

## Rows: 10,000
## Columns: 4
## $ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No,...
## $ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No...
## $ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885...
## $ income  <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491...

logit_model = glm(default ~ ., family = binomial(link = "logit"), data = bivariate)
summary(logit_model)

## 
## Call:
## glm(formula = default ~ ., family = binomial(link = "logit"), 
##     data = bivariate)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4691  -0.1418  -0.0557  -0.0203   3.7383  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
## studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
## balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
## income       3.033e-06  8.203e-06   0.370  0.71152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1579.5
## 
## Number of Fisher Scoring iterations: 8

The mfx package is useful if you want odds ratios or marginal effects:

mfx::logitor(default ~ ., data = bivariate) # Odds Ratios

## Call:
## mfx::logitor(formula = default ~ ., data = bivariate)
## 
## Odds Ratio:
##             OddsRatio  Std. Err.       z     P>|z|    
## studentYes 5.2373e-01 1.2373e-01 -2.7376  0.006188 ** 
## balance    1.0058e+00 2.3323e-04 24.7376 < 2.2e-16 ***
## income     1.0000e+00 8.2026e-06  0.3698  0.711520    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

mfx::logitmfx(default ~ ., data = bivariate) # Marginal Effects

## Call:
## mfx::logitmfx(formula = default ~ ., data = bivariate)
## 
## Marginal Effects:
##                  dF/dx   Std. Err.       z     P>|z|    
## studentYes -1.2055e-03  4.3151e-04 -2.7936  0.005212 ** 
## balance     1.1998e-05  1.8268e-06  6.5679 5.104e-11 ***
## income      6.3446e-09  1.7204e-08  0.3688  0.712287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## dF/dx is for discrete change for the following variables:
## 
## [1] "studentYes"

Poisson/Negative Binomial Regressions

poisson_model = glm(dv ~ iv1 + iv2 + iv3, family = poisson, data = df_goes_here)
summary(poisson_model)

Using sample data from the AER package:

data(RecreationDemand, package = "AER")
poisson_model = glm(trips ~ ., family = poisson, data = RecreationDemand)
summary(poisson_model)

## 
## Call:
## glm(formula = trips ~ ., family = poisson, data = RecreationDemand)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -11.8465   -1.1411   -0.8896   -0.4780   18.6071  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.264993   0.093722   2.827  0.00469 ** 
## quality      0.471726   0.017091  27.602  < 2e-16 ***
## skiyes       0.418214   0.057190   7.313 2.62e-13 ***
## income      -0.111323   0.019588  -5.683 1.32e-08 ***
## userfeeyes   0.898165   0.078985  11.371  < 2e-16 ***
## costC       -0.003430   0.003118  -1.100  0.27131    
## costS       -0.042536   0.001670 -25.467  < 2e-16 ***
## costH        0.036134   0.002710  13.335  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4849.7  on 658  degrees of freedom
## Residual deviance: 2305.8  on 651  degrees of freedom
## AIC: 3074.9
## 
## Number of Fisher Scoring iterations: 7

When estimating Poisson models, check for overdispersion:

performance::check_overdispersion(poisson_model)

## # Overdispersion test
## 
##        dispersion ratio =    6.298
##   Pearson's Chi-Squared = 4100.093
##                 p-value =  < 0.001

## Overdispersion detected.

Given the overdispersion present, let’s select a different distributional family. A negative binomial regression would be more appropriate:

neg_binom_model = MASS::glm.nb(trips ~ ., data = RecreationDemand)
summary(neg_binom_model)

## 
## Call:
## MASS::glm.nb(formula = trips ~ ., data = RecreationDemand, init.theta = 0.7292568331, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9727  -0.6256  -0.4619  -0.2897   5.0494  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.121936   0.214303  -5.235 1.65e-07 ***
## quality      0.721999   0.040117  17.998  < 2e-16 ***
## skiyes       0.612139   0.150303   4.073 4.65e-05 ***
## income      -0.026059   0.042453  -0.614    0.539    
## userfeeyes   0.669168   0.353021   1.896    0.058 .  
## costC        0.048009   0.009185   5.227 1.72e-07 ***
## costS       -0.092691   0.006653 -13.931  < 2e-16 ***
## costH        0.038836   0.007751   5.011 5.42e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.7293) family taken to be 1)
## 
##     Null deviance: 1244.61  on 658  degrees of freedom
## Residual deviance:  425.42  on 651  degrees of freedom
## AIC: 1669.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.7293 
##           Std. Err.:  0.0747 
## 
##  2 x log-likelihood:  -1651.1150

Fixed-Effects Panel Regressions

An example of a panel regression with year and individual fixed effects:

data(Grunfeld, package = "plm")
plm_fixed_model = plm::plm(inv ~ value + capital, data = Grunfeld, effect = "twoway", model = "within")
summary(plm_fixed_model)

## Twoways effects Within Model
## 
## Call:
## plm::plm(formula = inv ~ value + capital, data = Grunfeld, effect = "twoway", 
##     model = "within")
## 
## Balanced Panel: n = 10, T = 20, N = 200
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -162.6094  -19.4710   -1.2669   19.1277  211.8420 
## 
## Coefficients:
##         Estimate Std. Error t-value  Pr(>|t|)    
## value   0.117716   0.013751  8.5604 6.653e-15 ***
## capital 0.357916   0.022719 15.7540 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    1615600
## Residual Sum of Squares: 452150
## R-Squared:      0.72015
## Adj. R-Squared: 0.67047
## F-statistic: 217.442 on 2 and 169 DF, p-value: < 2.22e-16

Cox Proportional Hazards Model

library(survival)
data(lung)
cox_model = coxph(Surv(time, status) ~ age + sex + ph.ecog + meal.cal + wt.loss, data = lung)
summary(cox_model)

## Call:
## coxph(formula = Surv(time, status) ~ age + sex + ph.ecog + meal.cal + 
##     wt.loss, data = lung)
## 
##   n= 170, number of events= 123 
##    (58 observations deleted due to missingness)
## 
##                coef  exp(coef)   se(coef)      z Pr(>|z|)    
## age       8.693e-03  1.009e+00  1.124e-02  0.773 0.439303    
## sex      -5.400e-01  5.827e-01  1.988e-01 -2.716 0.006602 ** 
## ph.ecog   5.152e-01  1.674e+00  1.483e-01  3.474 0.000513 ***
## meal.cal -4.499e-05  1.000e+00  2.512e-04 -0.179 0.857865    
## wt.loss  -1.117e-02  9.889e-01  7.548e-03 -1.480 0.138841    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##          exp(coef) exp(-coef) lower .95 upper .95
## age         1.0087     0.9913    0.9868    1.0312
## sex         0.5827     1.7160    0.3947    0.8604
## ph.ecog     1.6739     0.5974    1.2517    2.2384
## meal.cal    1.0000     1.0000    0.9995    1.0004
## wt.loss     0.9889     1.0112    0.9744    1.0036
## 
## Concordance= 0.64  (se = 0.03 )
## Likelihood ratio test= 22.34  on 5 df,   p=5e-04
## Wald test            = 21.31  on 5 df,   p=7e-04
## Score (logrank) test = 21.86  on 5 df,   p=6e-04

Multiple Imputation

Refer to the Amelia documentation before using!

I suggest that Amelia be instructed to: (1) use multiple CPU cores, (2) perform transformations on certain variables (e.g., log-linear, square root, logistic function for proportional data, and so forth), and (3) create multiple imputed datasets. You can then use the mean or median value across all imputations.

imputed_data = Amelia::amelia(panel_data, m = 100, cs = "state", ts = "year", parallel = "snow", ncpus = 4) %>%
  purrr::reduce(imputed_data$imputations, dplyr::bind_rows) %>%
  group_by(state, year) %>%
  summarize_all(median, na.rm = T)

Robust Standard Errors

For linear estimators, using the estimatr package is rather straightforward:

ols_robust = estimatr::lm_robust(rating ~ ., data = attitude)
summary(ols_robust)

## 
## Call:
## estimatr::lm_robust(formula = rating ~ ., data = attitude)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)   CI Lower CI Upper DF
## (Intercept) 10.78708    14.4600  0.7460 4.632e-01 -19.125630  40.6998 23
## complaints   0.61319     0.1246  4.9210 5.675e-05   0.355418   0.8710 23
## privileges  -0.07305     0.1253 -0.5830 5.656e-01  -0.332249   0.1861 23
## learning     0.32033     0.1577  2.0318 5.388e-02  -0.005807   0.6465 23
## raises       0.08173     0.1732  0.4718 6.415e-01  -0.276599   0.4401 23
## critical     0.03838     0.1781  0.2156 8.312e-01  -0.329964   0.4067 23
## advance     -0.21706     0.1650 -1.3151 2.014e-01  -0.558482   0.1244 23
## 
## Multiple R-squared:  0.7326 ,    Adjusted R-squared:  0.6628 
## F-statistic: 11.24 on 6 and 23 DF,  p-value: 7.187e-06

For other models, you will likely have to use the lmtest and sandwich packages.

Ordinal Logit Regression

An example from the MASS package’s documentation.

data(housing, package = "MASS")
ordered_logistic_model = MASS::polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing, Hess = T)
summary(ordered_logistic_model)

## Call:
## MASS::polr(formula = Sat ~ Infl + Type + Cont, data = housing, 
##     weights = Freq, Hess = T)
## 
## Coefficients:
##                 Value Std. Error t value
## InflMedium     0.5664    0.10465   5.412
## InflHigh       1.2888    0.12716  10.136
## TypeApartment -0.5724    0.11924  -4.800
## TypeAtrium    -0.3662    0.15517  -2.360
## TypeTerrace   -1.0910    0.15149  -7.202
## ContHigh       0.3603    0.09554   3.771
## 
## Intercepts:
##             Value   Std. Error t value
## Low|Medium  -0.4961  0.1248    -3.9739
## Medium|High  0.6907  0.1255     5.5049
## 
## Residual Deviance: 3479.149 
## AIC: 3495.149

Multinomial Logistic Regression

example(birthwt, package = "MASS", echo = F)
tibble::glimpse(bwt)

## Rows: 189
## Columns: 9
## $ low   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ age   <int> 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, 30, 18, 1...
## $ lwt   <int> 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95, 150, 95...
## $ race  <fct> black, other, white, white, white, other, white, other, white...
## $ smoke <lgl> FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TR...
## $ ptd   <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ ht    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ ui    <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, F...
## $ ftv   <fct> 0, 2+, 1, 2+, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2+, 0, 0, 0, 2+, 0, ...

multinom_logit = nnet::multinom(low ~ ., data = bwt)

## # weights:  12 (11 variable)
## initial  value 131.004817 
## iter  10 value 98.029803
## final  value 97.737759 
## converged

summary(multinom_logit)

## Call:
## nnet::multinom(formula = low ~ ., data = bwt)
## 
## Coefficients:
##                  Values  Std. Err.
## (Intercept)  0.82320102 1.24476766
## age         -0.03723828 0.03870437
## lwt         -0.01565359 0.00708079
## raceblack    1.19240391 0.53598076
## raceother    0.74065606 0.46176615
## smokeTRUE    0.75550487 0.42503626
## ptdTRUE      1.34375901 0.48063449
## htTRUE       1.91320116 0.72076133
## uiTRUE       0.68020207 0.46434974
## ftv1        -0.43638470 0.47941107
## ftv2+        0.17900392 0.45639129
## 
## Residual Deviance: 195.4755 
## AIC: 217.4755

Instrumental Variable Regression by Two-Stage Least Squares

data(CigarettesSW, package = "AER")
CigarettesSW$rprice  = with(CigarettesSW, price/cpi)
CigarettesSW$rincome = with(CigarettesSW, income/population/cpi)
CigarettesSW$tdiff   = with(CigarettesSW, (taxs - tax)/cpi)

tibble::glimpse(CigarettesSW)

## Rows: 96
## Columns: 12
## $ state      <fct> AL, AR, AZ, CA, CO, CT, DE, FL, GA, IA, ID, IL, IN, KS, ...
## $ year       <fct> 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 19...
## $ cpi        <dbl> 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, ...
## $ population <dbl> 3973000, 2327000, 3184000, 26444000, 3209000, 3201000, 6...
## $ packs      <dbl> 116.4863, 128.5346, 104.5226, 100.3630, 112.9635, 109.27...
## $ income     <dbl> 46014968, 26210736, 43956936, 447102816, 49466672, 60063...
## $ tax        <dbl> 32.50000, 37.00000, 31.00000, 26.00000, 31.00000, 42.000...
## $ price      <dbl> 102.18167, 101.47500, 108.57875, 107.83734, 94.26666, 12...
## $ taxs       <dbl> 33.34834, 37.00000, 36.17042, 32.10400, 31.00000, 51.483...
## $ rprice     <dbl> 94.96438, 94.30762, 100.90962, 100.22058, 87.60842, 118....
## $ rincome    <dbl> 10.763866, 10.468165, 12.830456, 15.713321, 14.326190, 1...
## $ tdiff      <dbl> 0.7884121, 0.0000000, 4.8052211, 5.6728627, 0.0000000, 8...

iv_regression = AER::ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + tdiff + I(tax/cpi),
                           data = CigarettesSW, subset = year == "1995")
summary(iv_regression)

## 
## Call:
## AER::ivreg(formula = log(packs) ~ log(rprice) + log(rincome) | 
##     log(rincome) + tdiff + I(tax/cpi), data = CigarettesSW, subset = year == 
##     "1995")
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.6006931 -0.0862222 -0.0009999  0.1164699  0.3734227 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.8950     1.0586   9.348 4.12e-12 ***
## log(rprice)   -1.2774     0.2632  -4.853 1.50e-05 ***
## log(rincome)   0.2804     0.2386   1.175    0.246    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1879 on 45 degrees of freedom
## Multiple R-Squared: 0.4294,  Adjusted R-squared: 0.4041 
## Wald test: 13.28 on 2 and 45 DF,  p-value: 2.931e-05

Multilevel Mixed-Effects Models

library(lme4)

## Loading required package: Matrix

## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car

tibble::glimpse(sleepstudy)

## Rows: 180
## Columns: 3
## $ Reaction <dbl> 249.5600, 258.7047, 250.8006, 321.4398, 356.8519, 414.6901...
## $ Days     <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9...
## $ Subject  <fct> 308, 308, 308, 308, 308, 308, 308, 308, 308, 308, 309, 309...

mixed_effects_model = lmer(Reaction ~ Days + (Days | Subject), data = sleepstudy)
summary(mixed_effects_model)

## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (Days | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1743.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9536 -0.4634  0.0231  0.4634  5.1793 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 612.10   24.741       
##           Days         35.07    5.922   0.07
##  Residual             654.94   25.592       
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  251.405      6.825  36.838
## Days          10.467      1.546   6.771
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138

Regression Table Output

library(stargazer)
library(texreg)
library(xtable)

R for Econometrics