Linear Regression


Use a period if you want every variable (except the dependent variable specified) to be used as a covariate.

## 
## Call:
## lm(formula = rating ~ ., data = attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9418  -4.3555   0.3158   5.5425  11.5990 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.78708   11.58926   0.931 0.361634    
## complaints   0.61319    0.16098   3.809 0.000903 ***
## privileges  -0.07305    0.13572  -0.538 0.595594    
## learning     0.32033    0.16852   1.901 0.069925 .  
## raises       0.08173    0.22148   0.369 0.715480    
## critical     0.03838    0.14700   0.261 0.796334    
## advance     -0.21706    0.17821  -1.218 0.235577    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.068 on 23 degrees of freedom
## Multiple R-squared:  0.7326, Adjusted R-squared:  0.6628 
## F-statistic:  10.5 on 6 and 23 DF,  p-value: 1.24e-05


Formal Checks of OLS Model Assumptions

The performance package, also part of the easystats project, includes several easy commands to check OLS model assumptions.

## # A tibble: 6 x 3
##   Parameter    VIF SE_factor
##   <chr>      <dbl>     <dbl>
## 1 complaints  2.67      1.63
## 2 privileges  1.60      1.27
## 3 learning    2.27      1.51
## 4 raises      3.08      1.75
## 5 critical    1.23      1.11
## 6 advance     1.95      1.40
## OK: Residuals appear as normally distributed (p = 0.383).
## OK: Error variance appears to be homoscedastic (p = 0.153).


Visual Checks of Model Assumptions

The formal checks show that this model is fine, but visual checks help when learning OLS.

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 30 rows containing missing values (geom_text).


Logistic Regression


Using sample data from the ISLR package:

## Rows: 10,000
## Columns: 4
## $ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No,...
## $ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No...
## $ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885...
## $ income  <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491...
## 
## Call:
## glm(formula = default ~ ., family = binomial(link = "logit"), 
##     data = bivariate)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4691  -0.1418  -0.0557  -0.0203   3.7383  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
## studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
## balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
## income       3.033e-06  8.203e-06   0.370  0.71152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1579.5
## 
## Number of Fisher Scoring iterations: 8


The mfx package is useful if you want odds ratios or marginal effects:

## Call:
## mfx::logitor(formula = default ~ ., data = bivariate)
## 
## Odds Ratio:
##             OddsRatio  Std. Err.       z     P>|z|    
## studentYes 5.2373e-01 1.2373e-01 -2.7376  0.006188 ** 
## balance    1.0058e+00 2.3323e-04 24.7376 < 2.2e-16 ***
## income     1.0000e+00 8.2026e-06  0.3698  0.711520    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Call:
## mfx::logitmfx(formula = default ~ ., data = bivariate)
## 
## Marginal Effects:
##                  dF/dx   Std. Err.       z     P>|z|    
## studentYes -1.2055e-03  4.3151e-04 -2.7936  0.005212 ** 
## balance     1.1998e-05  1.8268e-06  6.5679 5.104e-11 ***
## income      6.3446e-09  1.7204e-08  0.3688  0.712287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## dF/dx is for discrete change for the following variables:
## 
## [1] "studentYes"


Poisson/Negative Binomial Regressions


Using sample data from the AER package:

## 
## Call:
## glm(formula = trips ~ ., family = poisson, data = RecreationDemand)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -11.8465   -1.1411   -0.8896   -0.4780   18.6071  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.264993   0.093722   2.827  0.00469 ** 
## quality      0.471726   0.017091  27.602  < 2e-16 ***
## skiyes       0.418214   0.057190   7.313 2.62e-13 ***
## income      -0.111323   0.019588  -5.683 1.32e-08 ***
## userfeeyes   0.898165   0.078985  11.371  < 2e-16 ***
## costC       -0.003430   0.003118  -1.100  0.27131    
## costS       -0.042536   0.001670 -25.467  < 2e-16 ***
## costH        0.036134   0.002710  13.335  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4849.7  on 658  degrees of freedom
## Residual deviance: 2305.8  on 651  degrees of freedom
## AIC: 3074.9
## 
## Number of Fisher Scoring iterations: 7

When estimating Poisson models, check for overdispersion:

## # Overdispersion test
## 
##        dispersion ratio =    6.298
##   Pearson's Chi-Squared = 4100.093
##                 p-value =  < 0.001
## Overdispersion detected.

Given the overdispersion present, let’s select a different distributional family. A negative binomial regression would be more appropriate:

## 
## Call:
## MASS::glm.nb(formula = trips ~ ., data = RecreationDemand, init.theta = 0.7292568331, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9727  -0.6256  -0.4619  -0.2897   5.0494  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.121936   0.214303  -5.235 1.65e-07 ***
## quality      0.721999   0.040117  17.998  < 2e-16 ***
## skiyes       0.612139   0.150303   4.073 4.65e-05 ***
## income      -0.026059   0.042453  -0.614    0.539    
## userfeeyes   0.669168   0.353021   1.896    0.058 .  
## costC        0.048009   0.009185   5.227 1.72e-07 ***
## costS       -0.092691   0.006653 -13.931  < 2e-16 ***
## costH        0.038836   0.007751   5.011 5.42e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.7293) family taken to be 1)
## 
##     Null deviance: 1244.61  on 658  degrees of freedom
## Residual deviance:  425.42  on 651  degrees of freedom
## AIC: 1669.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.7293 
##           Std. Err.:  0.0747 
## 
##  2 x log-likelihood:  -1651.1150


Fixed-Effects Panel Regressions

An example of a panel regression with year and individual fixed effects:

## Twoways effects Within Model
## 
## Call:
## plm::plm(formula = inv ~ value + capital, data = Grunfeld, effect = "twoway", 
##     model = "within")
## 
## Balanced Panel: n = 10, T = 20, N = 200
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -162.6094  -19.4710   -1.2669   19.1277  211.8420 
## 
## Coefficients:
##         Estimate Std. Error t-value  Pr(>|t|)    
## value   0.117716   0.013751  8.5604 6.653e-15 ***
## capital 0.357916   0.022719 15.7540 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    1615600
## Residual Sum of Squares: 452150
## R-Squared:      0.72015
## Adj. R-Squared: 0.67047
## F-statistic: 217.442 on 2 and 169 DF, p-value: < 2.22e-16


Cox Proportional Hazards Model

## Call:
## coxph(formula = Surv(time, status) ~ age + sex + ph.ecog + meal.cal + 
##     wt.loss, data = lung)
## 
##   n= 170, number of events= 123 
##    (58 observations deleted due to missingness)
## 
##                coef  exp(coef)   se(coef)      z Pr(>|z|)    
## age       8.693e-03  1.009e+00  1.124e-02  0.773 0.439303    
## sex      -5.400e-01  5.827e-01  1.988e-01 -2.716 0.006602 ** 
## ph.ecog   5.152e-01  1.674e+00  1.483e-01  3.474 0.000513 ***
## meal.cal -4.499e-05  1.000e+00  2.512e-04 -0.179 0.857865    
## wt.loss  -1.117e-02  9.889e-01  7.548e-03 -1.480 0.138841    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##          exp(coef) exp(-coef) lower .95 upper .95
## age         1.0087     0.9913    0.9868    1.0312
## sex         0.5827     1.7160    0.3947    0.8604
## ph.ecog     1.6739     0.5974    1.2517    2.2384
## meal.cal    1.0000     1.0000    0.9995    1.0004
## wt.loss     0.9889     1.0112    0.9744    1.0036
## 
## Concordance= 0.64  (se = 0.03 )
## Likelihood ratio test= 22.34  on 5 df,   p=5e-04
## Wald test            = 21.31  on 5 df,   p=7e-04
## Score (logrank) test = 21.86  on 5 df,   p=6e-04


Multiple Imputation

Refer to the Amelia documentation before using!

I suggest that Amelia be instructed to: (1) use multiple CPU cores, (2) perform transformations on certain variables (e.g., log-linear, square root, logistic function for proportional data, and so forth), and (3) create multiple imputed datasets. You can then use the mean or median value across all imputations.


Robust Standard Errors

For linear estimators, using the estimatr package is rather straightforward:

## 
## Call:
## estimatr::lm_robust(formula = rating ~ ., data = attitude)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)   CI Lower CI Upper DF
## (Intercept) 10.78708    14.4600  0.7460 4.632e-01 -19.125630  40.6998 23
## complaints   0.61319     0.1246  4.9210 5.675e-05   0.355418   0.8710 23
## privileges  -0.07305     0.1253 -0.5830 5.656e-01  -0.332249   0.1861 23
## learning     0.32033     0.1577  2.0318 5.388e-02  -0.005807   0.6465 23
## raises       0.08173     0.1732  0.4718 6.415e-01  -0.276599   0.4401 23
## critical     0.03838     0.1781  0.2156 8.312e-01  -0.329964   0.4067 23
## advance     -0.21706     0.1650 -1.3151 2.014e-01  -0.558482   0.1244 23
## 
## Multiple R-squared:  0.7326 ,    Adjusted R-squared:  0.6628 
## F-statistic: 11.24 on 6 and 23 DF,  p-value: 7.187e-06

For other models, you will likely have to use the lmtest and sandwich packages.


Ordinal Logit Regression

An example from the MASS package’s documentation.

## Call:
## MASS::polr(formula = Sat ~ Infl + Type + Cont, data = housing, 
##     weights = Freq, Hess = T)
## 
## Coefficients:
##                 Value Std. Error t value
## InflMedium     0.5664    0.10465   5.412
## InflHigh       1.2888    0.12716  10.136
## TypeApartment -0.5724    0.11924  -4.800
## TypeAtrium    -0.3662    0.15517  -2.360
## TypeTerrace   -1.0910    0.15149  -7.202
## ContHigh       0.3603    0.09554   3.771
## 
## Intercepts:
##             Value   Std. Error t value
## Low|Medium  -0.4961  0.1248    -3.9739
## Medium|High  0.6907  0.1255     5.5049
## 
## Residual Deviance: 3479.149 
## AIC: 3495.149


Multinomial Logistic Regression

## Rows: 189
## Columns: 9
## $ low   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ age   <int> 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, 30, 18, 1...
## $ lwt   <int> 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95, 150, 95...
## $ race  <fct> black, other, white, white, white, other, white, other, white...
## $ smoke <lgl> FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TR...
## $ ptd   <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ ht    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ ui    <lgl> TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, F...
## $ ftv   <fct> 0, 2+, 1, 2+, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2+, 0, 0, 0, 2+, 0, ...
## # weights:  12 (11 variable)
## initial  value 131.004817 
## iter  10 value 98.029803
## final  value 97.737759 
## converged
## Call:
## nnet::multinom(formula = low ~ ., data = bwt)
## 
## Coefficients:
##                  Values  Std. Err.
## (Intercept)  0.82320102 1.24476766
## age         -0.03723828 0.03870437
## lwt         -0.01565359 0.00708079
## raceblack    1.19240391 0.53598076
## raceother    0.74065606 0.46176615
## smokeTRUE    0.75550487 0.42503626
## ptdTRUE      1.34375901 0.48063449
## htTRUE       1.91320116 0.72076133
## uiTRUE       0.68020207 0.46434974
## ftv1        -0.43638470 0.47941107
## ftv2+        0.17900392 0.45639129
## 
## Residual Deviance: 195.4755 
## AIC: 217.4755


Instrumental Variable Regression by Two-Stage Least Squares

## Rows: 96
## Columns: 12
## $ state      <fct> AL, AR, AZ, CA, CO, CT, DE, FL, GA, IA, ID, IL, IN, KS, ...
## $ year       <fct> 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 19...
## $ cpi        <dbl> 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, ...
## $ population <dbl> 3973000, 2327000, 3184000, 26444000, 3209000, 3201000, 6...
## $ packs      <dbl> 116.4863, 128.5346, 104.5226, 100.3630, 112.9635, 109.27...
## $ income     <dbl> 46014968, 26210736, 43956936, 447102816, 49466672, 60063...
## $ tax        <dbl> 32.50000, 37.00000, 31.00000, 26.00000, 31.00000, 42.000...
## $ price      <dbl> 102.18167, 101.47500, 108.57875, 107.83734, 94.26666, 12...
## $ taxs       <dbl> 33.34834, 37.00000, 36.17042, 32.10400, 31.00000, 51.483...
## $ rprice     <dbl> 94.96438, 94.30762, 100.90962, 100.22058, 87.60842, 118....
## $ rincome    <dbl> 10.763866, 10.468165, 12.830456, 15.713321, 14.326190, 1...
## $ tdiff      <dbl> 0.7884121, 0.0000000, 4.8052211, 5.6728627, 0.0000000, 8...
## 
## Call:
## AER::ivreg(formula = log(packs) ~ log(rprice) + log(rincome) | 
##     log(rincome) + tdiff + I(tax/cpi), data = CigarettesSW, subset = year == 
##     "1995")
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.6006931 -0.0862222 -0.0009999  0.1164699  0.3734227 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.8950     1.0586   9.348 4.12e-12 ***
## log(rprice)   -1.2774     0.2632  -4.853 1.50e-05 ***
## log(rincome)   0.2804     0.2386   1.175    0.246    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1879 on 45 degrees of freedom
## Multiple R-Squared: 0.4294,  Adjusted R-squared: 0.4041 
## Wald test: 13.28 on 2 and 45 DF,  p-value: 2.931e-05


Multilevel Mixed-Effects Models

## Loading required package: Matrix
## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car
## Rows: 180
## Columns: 3
## $ Reaction <dbl> 249.5600, 258.7047, 250.8006, 321.4398, 356.8519, 414.6901...
## $ Days     <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9...
## $ Subject  <fct> 308, 308, 308, 308, 308, 308, 308, 308, 308, 308, 309, 309...
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (Days | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1743.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9536 -0.4634  0.0231  0.4634  5.1793 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 612.10   24.741       
##           Days         35.07    5.922   0.07
##  Residual             654.94   25.592       
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  251.405      6.825  36.838
## Days          10.467      1.546   6.771
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138


Regression Table Output