Contents

Contents

This workshop covers steps for conducting statistical analyses in R and preparing tables and figures.

  • Presenting descriptive statistics & bivariate relationships
  • Linear Regression
  • Models for Categorical Outcome
  • Models for Counts
  • Coming soon…
    • Mixed Effects
    • Survival Analysis
    • requests?

It is assumed that you are familiar with R (e.g., you know about data types, data structures, and how to clean and create variables). If this is not the case, you might want to check out the Intro to R Workshop

Useful resources:

Setup

We need a few R packages which can be installed with the following command:

install.packages(c("haven", "stargazer", "margins", "prediction", "dotwhisker")

and the packages can be loaded with:

library(haven)
library(stargazer)
library(prediction)
library(margins)
library(dotwhisker)
library(MASS)
library(nnet)

Table of Descriptives

Table of descriptive statistics

Row

Stata


. sysuse auto, cl(1978 Automobile Data)

. save auto.dta, replace
file auto.dta saved

. estpost summarize price mpg rep78 foreign, listwise

             |  e(count)   e(sum_w)    e(mean)     e(Var)      e(sd) 
-------------+-------------------------------------------------------
       price |        69         69   6146.043    8482308    2912.44 
         mpg |        69         69   21.28986   34.41475   5.866408 
       rep78 |        69         69   3.405797   .9799659   .9899323 
     foreign |        69         69   .3043478   .2148338   .4635016 

             |    e(min)     e(max)     e(sum) 
-------------+---------------------------------
       price |      3291      15906     424077 
         mpg |        12         41       1469 
       rep78 |         1          5        235 
     foreign |         0          1         21 

. esttab, cells("mean sd min max")

----------------------------------------------------------------
                      (1)                                       
                                                                
                     mean           sd          min          max
----------------------------------------------------------------
price            6146.043      2912.44         3291        15906
mpg              21.28986     5.866408           12           41
rep78            3.405797     .9899323            1            5
foreign          .3043478     .4635016            0            1
----------------------------------------------------------------
N                      69                                       
----------------------------------------------------------------

. 

R

# load and summarize the data
autoData <- read_dta('auto.dta')
N <- nrow(autoData)
## summary(autoData)
stargazer(data.frame(autoData), type='html', out = "descriptives.html",
          summary.stat = c("mean", "sd", "min", "max"),
          notes = paste("Number of observations =", N))
Statistic Mean St. Dev. Min Max
price 6,165.257 2,949.496 3,291 15,906
mpg 21.297 5.786 12 41
rep78 3.406 0.990 1.000 5.000
headroom 2.993 0.846 1.500 5.000
trunk 13.757 4.277 5 23
weight 3,019.459 777.194 1,760 4,840
length 187.932 22.266 142 233
turn 39.649 4.399 31 51
displacement 197.297 91.837 79 425
gear_ratio 3.015 0.456 2.190 3.890
foreign 0.297 0.460 0 1
Number of observations = 74
# open the file descriptives.html, then copy and paste to excel

Linear Models

Introduction

Here we do the following with the prestige data

  • Contents
    • linear regression with continuous and categorical predictors
    • creating tables with results from multiple models (with interactions and subsetting)
    • margins plots
  • Data: prestige
    • unit of analysis: occupations
    • prestige: % of respondents who rated the occupation as “good” or “excellent”
    • educ: % of workers in the occupation who had a HS degree
    • income: % of workers in the occupation who earned more than $3,500
    • occ_type: indicator for type of occupation, with values blue collar (bc), professional and managerial (prof), and white collar (wc)

Fit linear model

Fit linear regression model

Row

Stata


. * load and save the d. use https://stats.idre.ucla.edu/stat/stata/examples/ara/duncan, clear
(From Fox, Applied Regression Analysis.  Use 'notes' command for source of data
> )

. save duncan.dta, replace
file duncan.dta saved

. 
. * summarize the data and fit a linear model

. ds
occtitle  income    educ      prestige  occ_type

. regress prestige income educ i.occ_type

      Source |       SS           df       MS      Number of obs   =        45
-------------+----------------------------------   F(4, 40)        =    105.03
       Model |  39889.6897         4  9972.42242   Prob > F        =    0.0000
    Residual |  3797.95474        40  94.9488686   R-squared       =    0.9131
-------------+----------------------------------   Adj R-squared   =    0.9044
       Total |  43687.6444        44   992.90101   Root MSE        =    9.7442

------------------------------------------------------------------------------
    prestige |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      income |   .5975465   .0893553     6.69   0.000     .4169527    .7781403
        educ |   .3453193   .1136089     3.04   0.004     .1157071    .5749315
             |
    occ_type |
       prof  |   16.65751   6.993007     2.38   0.022      2.52412    30.79091
         wc  |  -14.66113   6.108774    -2.40   0.021    -27.00743   -2.314841
             |
       _cons |  -.1850278   3.713771    -0.05   0.961    -7.690838    7.320782
------------------------------------------------------------------------------

. 

R

# load and summarize the data
duncan <- read_dta('duncan.dta')
summary(duncan)
   occtitle             income           educ           prestige    
 Length:45          Min.   : 7.00   Min.   :  7.00   Min.   : 3.00  
 Class :character   1st Qu.:21.00   1st Qu.: 26.00   1st Qu.:16.00  
 Mode  :character   Median :42.00   Median : 45.00   Median :41.00  
                    Mean   :41.87   Mean   : 52.56   Mean   :47.69  
                    3rd Qu.:64.00   3rd Qu.: 84.00   3rd Qu.:81.00  
                    Max.   :81.00   Max.   :100.00   Max.   :97.00  
    occ_type    
 Min.   :1.000  
 1st Qu.:1.000  
 Median :2.000  
 Mean   :1.667  
 3rd Qu.:2.000  
 Max.   :3.000  
str(duncan$occ_type)
 dbl+lbl [1:45] 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2,...
 @ label       : chr "Occupation type, Professional/Manag, White Collar, Blue Collar"
 @ format.stata: chr "%8.0g"
 @ labels      : Named num [1:3] 1 2 3
  ..- attr(*, "names")= chr [1:3] "bc" "prof" "wc"
is.labelled(duncan$occ_type)
[1] TRUE
table(as_factor(duncan$occ_type), duncan$occ_type)
      
        1  2  3
  bc   21  0  0
  prof  0 18  0
  wc    0  0  6
# fit the model and print the results
model1 <- lm(prestige ~ income + educ + factor(occ_type), data = duncan)
model1

Call:
lm(formula = prestige ~ income + educ + factor(occ_type), data = duncan)

Coefficients:
      (Intercept)             income               educ  factor(occ_type)2  
          -0.1850             0.5975             0.3453            16.6575  
factor(occ_type)3  
         -14.6611  
# print even more results
summary(model1)

Call:
lm(formula = prestige ~ income + educ + factor(occ_type), data = duncan)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.890  -5.740  -1.754   5.442  28.972 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -0.18503    3.71377  -0.050  0.96051    
income              0.59755    0.08936   6.687 5.12e-08 ***
educ                0.34532    0.11361   3.040  0.00416 ** 
factor(occ_type)2  16.65751    6.99301   2.382  0.02206 *  
factor(occ_type)3 -14.66113    6.10877  -2.400  0.02114 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.744 on 40 degrees of freedom
Multiple R-squared:  0.9131,    Adjusted R-squared:  0.9044 
F-statistic:   105 on 4 and 40 DF,  p-value: < 2.2e-16

Accessing results

Accessing results

Row

Stata


. use duncan, cl(From Fox, Applied Regression Analysis.  Use 'notes' command for source of data
> )

. quietly: regress prestige income educ i.occ_type

. 
. * display results

. ereturn list

scalars:
                  e(N) =  45
               e(df_m) =  4
               e(df_r) =  40
                  e(F) =  105.0293970832132
                 e(r2) =  .9130657009984507
               e(rmse) =  9.744171007333955
                e(mss) =  39889.68969963773
                e(rss) =  3797.954744806706
               e(r2_a) =  .9043722710982958
                 e(ll) =  -163.6522324291697
               e(ll_0) =  -218.6107915932074
               e(rank) =  5

macros:
            e(cmdline) : "regress prestige income educ i.occ_type"
              e(title) : "Linear regression"
          e(marginsok) : "XB default"
                e(vce) : "ols"
             e(depvar) : "prestige"
                e(cmd) : "regress"
         e(properties) : "b V"
            e(predict) : "regres_p"
              e(model) : "ols"
          e(estat_cmd) : "regress_estat"

matrices:
                  e(b) :  1 x 6
                  e(V) :  6 x 6

functions:
             e(sample)   

. display e(r2)
.9130657

. matrix list e(b)

e(b)[1,6]
                                    1b.          2.          3.            
        income        educ    occ_type    occ_type    occ_type       _cons
y1    .5975465   .34531933           0   16.657513  -14.661133  -.18502783

. 

R

names(model1)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "contrasts"     "xlevels"       "call"          "terms"        
[13] "model"        
names(summary(model1))
 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled" 
summary(model1)$r.squared
[1] 0.9130657
model1$coefficients
      (Intercept)            income              educ factor(occ_type)2 
       -0.1850278         0.5975465         0.3453193        16.6575134 
factor(occ_type)3 
      -14.6611334 

Multiple models

Multiple models

Row

Stata


. use duncan, cl(From Fox, Applied Regression Analysis.  Use 'notes' command for source of data
> )

. quietly: regress prestige income

. eststo mod1

. quietly: regress prestige income educ

. eststo mod2

. quietly: regress prestige income educ i.occ_type

. eststo mod3

. quietly: regress prestige income educ i.occ_type c.income#c.educ

. eststo mod4

. estout mod1 mod2 mod3 mod4, cells(b(fmt(4) star) se(par)) stats(r2 N)

----------------------------------------------------------------------------
                     mod1            mod2            mod3            mod4   
                     b/se            b/se            b/se            b/se   
----------------------------------------------------------------------------
income             1.0804***       0.5987***       0.5975***       0.8520***
                 (0.1074)        (0.1197)        (0.0894)        (0.1494)   
educ                               0.5458***       0.3453**        0.6400***
                                 (0.0983)        (0.1136)        (0.1788)   
1.occ_type                                         0.0000          0.0000   
                                                      (.)             (.)   
2.occ_type                                        16.6575*        13.6764   
                                                 (6.9930)        (6.8700)   
3.occ_type                                       -14.6611*       -19.4339** 
                                                 (6.1088)        (6.3016)   
c.income#c~c                                                      -0.0053*  
                                                                 (0.0025)   
_cons              2.4566         -6.0647         -0.1850        -10.2146   
                 (5.1901)        (4.2719)        (3.7138)        (5.9971)   
----------------------------------------------------------------------------
r2                 0.7019          0.8282          0.9131          0.9218   
N                 45.0000         45.0000         45.0000         45.0000   
----------------------------------------------------------------------------

. * syntax for writing results: using stataModels.csv

. 

R

model1 <- lm(prestige ~ income, data = duncan)
model2 <- lm(prestige ~ income + educ, data = duncan)
model3 <- update(model2, . ~ . + as_factor(occ_type), data = duncan)
model4 <- lm(prestige ~ income*educ + as_factor(occ_type), data = duncan)
cov_labels <- c("Income", "Education", "Professional", "White Collar", 
                "Income * Education", "Intercept")
stargazer(model1, model2, model3, model4, type = "html", out = "models.html",
          model.numbers = FALSE, dep.var.caption = "", dep.var.labels.include = FALSE,
          column.labels = c("Model 1", "Model 2", "Model 3", "Model 4"),
          star.cutoffs = c(0.05, 0.01, 0.001),
          table.layout = "-c-!t-s-n",
          covariate.labels = cov_labels, digits = 4, keep.stat = c("rsq", "n"))
Model 1 Model 2 Model 3 Model 4
Income 1.0804*** 0.5987*** 0.5975*** 0.8520***
(0.1074) (0.1197) (0.0894) (0.1494)
Education 0.5458*** 0.3453** 0.6400***
(0.0983) (0.1136) (0.1788)
Professional 16.6575* 13.6764
(6.9930) (6.8700)
White Collar -14.6611* -19.4339**
(6.1088) (6.3016)
Income * Education -0.0053*
(0.0025)
Intercept 2.4566 -6.0647 -0.1850 -10.2146
(5.1901) (4.2719) (3.7138) (5.9971)
Observations 45 45 45 45
R2 0.7019 0.8282 0.9131 0.9218
Note: p<0.05; p<0.01; p<0.001
## open the file models.html, then copy and paste into MS Excel
## check out option: style (journal-specific format)

Subsetting

Subsetting

Row

Stata


. use duncan, cl(From Fox, Applied Regression Analysis.  Use 'notes' command for source of data
> )

. codebook occ_type

-------------------------------------------------------------------------------
occ_type         Occupation type, Professional/Manag, White Collar, Blue Collar
-------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  occ_type

                 range:  [1,3]                        units:  1
         unique values:  3                        missing .:  0/45

            tabulation:  Freq.   Numeric  Label
                            21         1  bc
                            18         2  prof
                             6         3  wc

. regress prestige income educ if occ_type == 1

      Source |       SS           df       MS      Number of obs   =        21
-------------+----------------------------------   F(2, 18)        =     25.26
       Model |  4806.89123         2  2403.44562   Prob > F        =    0.0000
    Residual |  1712.91829        18  95.1621272   R-squared       =    0.7373
-------------+----------------------------------   Adj R-squared   =    0.7081
       Total |  6519.80952        20  325.990476   Root MSE        =    9.7551

------------------------------------------------------------------------------
    prestige |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      income |   .7834109   .1321951     5.93   0.000     .5056792    1.061143
        educ |   .3196229   .2829142     1.13   0.273    -.2747578    .9140035
       _cons |  -3.950543   6.869833    -0.58   0.572    -18.38353    10.48244
------------------------------------------------------------------------------

. regress prestige income educ if occ_type == 2

      Source |       SS           df       MS      Number of obs   =        18
-------------+----------------------------------   F(2, 15)        =      9.09
       Model |  1852.96055         2  926.480274   Prob > F        =    0.0026
    Residual |   1529.4839        15  101.965593   R-squared       =    0.5478
-------------+----------------------------------   Adj R-squared   =    0.4875
       Total |  3382.44444        17   198.96732   Root MSE        =    10.098

------------------------------------------------------------------------------
    prestige |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      income |   .4142683   .1637481     2.53   0.023     .0652476    .7632891
        educ |   .3382139   .1589952     2.13   0.050    -.0006764    .6771043
       _cons |   28.05726   12.94298     2.17   0.047     .4699643    55.64456
------------------------------------------------------------------------------

. regress prestige income educ if occ_type == 3

      Source |       SS           df       MS      Number of obs   =         6
-------------+----------------------------------   F(2, 3)         =      8.14
       Model |  587.090326         2  293.545163   Prob > F        =    0.0614
    Residual |  108.243008         3  36.0810025   R-squared       =    0.8443
-------------+----------------------------------   Adj R-squared   =    0.7405
       Total |  695.333333         5  139.066667   Root MSE        =    6.0067

------------------------------------------------------------------------------
    prestige |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      income |   .4231025   .1396208     3.03   0.056    -.0212332    .8674382
        educ |   .4263938   .1431742     2.98   0.059    -.0292504     .882038
       _cons |  -10.99375   12.13372    -0.91   0.432    -49.60865    27.62116
------------------------------------------------------------------------------

. 

R

mod1_bc <- lm(prestige ~ income + educ, data = duncan, 
              subset = occ_type == 1)
mod1_prof <- lm(prestige ~ income + educ, data = duncan, 
                subset = occ_type == 2)
mod1_wc <- lm(prestige ~ income + educ, data = duncan, 
              subset = occ_type == 3)
summary(mod1_bc)

Call:
lm(formula = prestige ~ income + educ, data = duncan, subset = occ_type == 
    1)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.2629  -6.3580   0.0742   5.1065  22.5198 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.9505     6.8698  -0.575    0.572    
income        0.7834     0.1322   5.926 1.31e-05 ***
educ          0.3196     0.2829   1.130    0.273    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.755 on 18 degrees of freedom
Multiple R-squared:  0.7373,    Adjusted R-squared:  0.7081 
F-statistic: 25.26 on 2 and 18 DF,  p-value: 5.964e-06
summary(mod1_prof)

Call:
lm(formula = prestige ~ income + educ, data = duncan, subset = occ_type == 
    2)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.338  -5.216  -0.416   5.920  21.833 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  28.0573    12.9430   2.168   0.0467 *
income        0.4143     0.1637   2.530   0.0231 *
educ          0.3382     0.1590   2.127   0.0504 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.1 on 15 degrees of freedom
Multiple R-squared:  0.5478,    Adjusted R-squared:  0.4875 
F-statistic: 9.086 on 2 and 15 DF,  p-value: 0.002599
summary(mod1_wc)

Call:
lm(formula = prestige ~ income + educ, data = duncan, subset = occ_type == 
    3)

Residuals:
     9     16     21     22     23     24 
-2.450  2.341  7.023  1.233 -1.551 -6.596 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -10.9937    12.1337  -0.906   0.4317  
income        0.4231     0.1396   3.030   0.0563 .
educ          0.4264     0.1432   2.978   0.0587 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.007 on 3 degrees of freedom
Multiple R-squared:  0.8443,    Adjusted R-squared:  0.7405 
F-statistic: 8.136 on 2 and 3 DF,  p-value: 0.06142

Stata (another example)


. use duncan, cl(From Fox, Applied Regression Analysis.  Use 'notes' command for source of data
> )

. regress prestige income educ if occ_type == 1 | occ_type == 3

      Source |       SS           df       MS      Number of obs   =        27
-------------+----------------------------------   F(2, 24)        =     24.98
       Model |  5483.40426         2  2741.70213   Prob > F        =    0.0000
    Residual |  2634.00315        24  109.750131   R-squared       =    0.6755
-------------+----------------------------------   Adj R-squared   =    0.6485
       Total |  8117.40741        26  312.207977   Root MSE        =    10.476

------------------------------------------------------------------------------
    prestige |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      income |   .6289111   .1140262     5.52   0.000     .3935726    .8642496
        educ |     .10142   .1283866     0.79   0.437     -.163557     .366397
       _cons |   3.763145   4.288978     0.88   0.389    -5.088871    12.61516
------------------------------------------------------------------------------

. 

R (another example)

duncan$occBCWC <- duncan$occ_type==1 | duncan$occ_type==3
table(duncan$occBCWC, duncan$occ_type)
       
         1  2  3
  FALSE  0 18  0
  TRUE  21  0  6
mod1_bcwc <- lm(prestige ~ income + educ, data = duncan, 
              subset = occBCWC)
summary(mod1_bcwc)

Call:
lm(formula = prestige ~ income + educ, data = duncan, subset = occBCWC)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.009  -6.064  -2.194   6.147  27.351 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.7631     4.2890   0.877    0.389    
income        0.6289     0.1140   5.515 1.13e-05 ***
educ          0.1014     0.1284   0.790    0.437    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.48 on 24 degrees of freedom
Multiple R-squared:  0.6755,    Adjusted R-squared:  0.6485 
F-statistic: 24.98 on 2 and 24 DF,  p-value: 1.363e-06

Margins plots

Margins plots

Row

Stata (y-hat)


. use duncan, cl(From Fox, Applied Regression Analysis.  Use 'notes' command for source of data
> )

. regress prestige occ_type##c.educ

      Source |       SS           df       MS      Number of obs   =        45
-------------+----------------------------------   F(5, 39)        =     36.59
       Model |    36010.99         5    7202.198   Prob > F        =    0.0000
    Residual |  7676.65446        39  196.837294   R-squared       =    0.8243
-------------+----------------------------------   Adj R-squared   =    0.8018
       Total |  43687.6444        44   992.90101   Root MSE        =     14.03

------------------------------------------------------------------------------
    prestige |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    occ_type |
       prof  |   42.66032   19.47572     2.19   0.035      3.26697    82.05367
         wc  |   16.21902   23.41483     0.69   0.493    -31.14193    63.57998
             |
        educ |   1.011168   .3706646     2.73   0.010     .2614281    1.760908
             |
    occ_type#|
      c.educ |
       prof  |  -.5115146   .4222935    -1.21   0.233    -1.365684    .3426546
         wc  |  -.6322738   .4978774    -1.27   0.212    -1.639326    .3747783
             |
       _cons |  -2.854351   9.876664    -0.29   0.774    -22.83179    17.12309
------------------------------------------------------------------------------

. margins occ_type

Predictive margins                              Number of obs     =         45
Model VCE    : OLS

Expression   : Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    occ_type |
         bc  |   50.28814   10.54456     4.77   0.000     28.95977    71.61652
       prof  |   66.06553   6.696261     9.87   0.000     52.52106       79.61
         wc  |   33.27767    6.45335     5.16   0.000     20.22454     46.3308
------------------------------------------------------------------------------

. marginsplot

  Variables that uniquely identify margins: occ_type

. graph export "marginsPlot.svg", replace
(file marginsPlot.svg written in SVG format)

. 



R (y-hat)

duncan$occ_factor <- factor(duncan$occ_type, levels = 1:3, labels = c("bc", "prof", "wc"))
m1 <- lm(prestige ~ educ*occ_factor, data = duncan)
summary(m1)

Call:
lm(formula = prestige ~ educ * occ_factor, data = duncan)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.777  -9.369   0.229   6.726  41.542 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)   
(Intercept)          -2.8544     9.8767  -0.289   0.7741   
educ                  1.0112     0.3707   2.728   0.0095 **
occ_factorprof       42.6603    19.4757   2.190   0.0345 * 
occ_factorwc         16.2190    23.4148   0.693   0.4926   
educ:occ_factorprof  -0.5115     0.4223  -1.211   0.2331   
educ:occ_factorwc    -0.6323     0.4979  -1.270   0.2116   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.03 on 39 degrees of freedom
Multiple R-squared:  0.8243,    Adjusted R-squared:  0.8018 
F-statistic: 36.59 on 5 and 39 DF,  p-value: 1.015e-13
p1 <- prediction(m1, at = list(occ_factor = c("bc", "prof", "wc")))
summary(p1)
 at(occ_factor) Prediction     SE     z         p lower upper
             bc      50.29 10.545 4.769 1.850e-06 29.62 70.96
           prof      66.07  6.696 9.866 5.843e-23 52.94 79.19
             wc      33.28  6.453 5.157 2.514e-07 20.63 45.93
yhat1 <- summary(p1)[, 1:5]
names(yhat1) <- c('term', 'estimate', 'std.error', 'statistic', 'p.value')
yhat1$term <- c('Blue Collar', 'Professional', 'White Collar')
dwplot(yhat1)

# Stata uses a t-distribution: qt(0.975, df = n - df)
cbind(yhat1[, 2] - qt(0.975, df = 39) * yhat1[, 3],
      yhat1[, 2] + qt(0.975, df = 39) * yhat1[, 3])
         [,1]     [,2]
[1,] 28.95977 71.61652
[2,] 52.52106 79.61000
[3,] 20.22454 46.33080
# we could code this up ourselves with the CIs based on t-dist...

Stata (marginal effects)


. use duncan, cl(From Fox, Applied Regression Analysis.  Use 'notes' command for source of data
> )

. regress prestige occ_type##c.educ

      Source |       SS           df       MS      Number of obs   =        45
-------------+----------------------------------   F(5, 39)        =     36.59
       Model |    36010.99         5    7202.198   Prob > F        =    0.0000
    Residual |  7676.65446        39  196.837294   R-squared       =    0.8243
-------------+----------------------------------   Adj R-squared   =    0.8018
       Total |  43687.6444        44   992.90101   Root MSE        =     14.03

------------------------------------------------------------------------------
    prestige |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    occ_type |
       prof  |   42.66032   19.47572     2.19   0.035      3.26697    82.05367
         wc  |   16.21902   23.41483     0.69   0.493    -31.14193    63.57998
             |
        educ |   1.011168   .3706646     2.73   0.010     .2614281    1.760908
             |
    occ_type#|
      c.educ |
       prof  |  -.5115146   .4222935    -1.21   0.233    -1.365684    .3426546
         wc  |  -.6322738   .4978774    -1.27   0.212    -1.639326    .3747783
             |
       _cons |  -2.854351   9.876664    -0.29   0.774    -22.83179    17.12309
------------------------------------------------------------------------------

. margins, dydx(*)

Average marginal effects                        Number of obs     =         45
Model VCE    : OLS

Expression   : Linear prediction, predict()
dy/dx w.r.t. : 2.occ_type 3.occ_type educ

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    occ_type |
       prof  |   15.77739    12.4911     1.26   0.214    -9.488246    41.04302
         wc  |  -17.01048   12.36258    -1.38   0.177    -42.01615    7.995202
             |
        educ |    .722259   .1960499     3.68   0.001     .3257106    1.118807
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

. marginsplot

  Variables that uniquely identify margins: _deriv

. graph export "marginsPlotEffects.svg", replace
(file marginsPlotEffects.svg written in SVG format)

. 



R (marginal effects)

m1 <- lm(prestige ~ educ*factor(occ_type), data = duncan)
marg1 <- margins(m1)
summary(marg1)
    factor      AME      SE       z      p    lower   upper
      educ   0.7223  0.1960  3.6841 0.0002   0.3380  1.1065
 occ_type2  15.7774 12.4911  1.2631 0.2066  -8.7047 40.2595
 occ_type3 -17.0105 12.3626 -1.3760 0.1688 -41.2407  7.2197
plot(marg1)

## use the following code to save the plot as a pdf file
## pdf('marg1_plot.pdf')
## plot(marg1)
## dev.off()

## alternative functions, to pdf(), include bmp(), jpeg(),
## png(), and tiff()

Logistic Regression

Logistic Regression

Row

Stata


. use https://stats.idre.ucla.edu/stat/stata/dae/binary.dta, cl. save gradSchool.dta, replace
file gradSchool.dta saved

. 
. su admit

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       admit |        400       .3175    .4660867          0          1

. 
. logit admit gre gpa i.rank

Iteration 0:   log likelihood = -249.98826  
Iteration 1:   log likelihood = -229.66446  
Iteration 2:   log likelihood = -229.25955  
Iteration 3:   log likelihood = -229.25875  
Iteration 4:   log likelihood = -229.25875  

Logistic regression                             Number of obs     =        400
                                                LR chi2(5)        =      41.46
                                                Prob > chi2       =     0.0000
Log likelihood = -229.25875                     Pseudo R2         =     0.0829

------------------------------------------------------------------------------
       admit |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         gre |   .0022644    .001094     2.07   0.038     .0001202    .0044086
         gpa |   .8040377   .3318193     2.42   0.015     .1536838    1.454392
             |
        rank |
          2  |  -.6754429   .3164897    -2.13   0.033    -1.295751   -.0551346
          3  |  -1.340204   .3453064    -3.88   0.000    -2.016992   -.6634158
          4  |  -1.551464   .4178316    -3.71   0.000    -2.370399   -.7325287
             |
       _cons |  -3.989979   1.139951    -3.50   0.000    -6.224242   -1.755717
------------------------------------------------------------------------------

. 
. 

R

gradSchool <- read_dta('gradSchool.dta')
logit1 <- glm(admit ~ gre + gpa + factor(rank), data = gradSchool,
              family = binomial(link = "logit"))
summary(logit1)

Call:
glm(formula = admit ~ gre + gpa + factor(rank), family = binomial(link = "logit"), 
    data = gradSchool)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6268  -0.8662  -0.6388   1.1490   2.0790  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -3.989979   1.139951  -3.500 0.000465 ***
gre            0.002264   0.001094   2.070 0.038465 *  
gpa            0.804038   0.331819   2.423 0.015388 *  
factor(rank)2 -0.675443   0.316490  -2.134 0.032829 *  
factor(rank)3 -1.340204   0.345306  -3.881 0.000104 ***
factor(rank)4 -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.52  on 394  degrees of freedom
AIC: 470.52

Number of Fisher Scoring iterations: 4

Probit Regression

Probit Regression

Row

Stata


. use gradSchool.dta, repl. probit admit gre gpa i.rank

Iteration 0:   log likelihood = -249.98826  
Iteration 1:   log likelihood = -229.29667  
Iteration 2:   log likelihood = -229.20659  
Iteration 3:   log likelihood = -229.20658  

Probit regression                               Number of obs     =        400
                                                LR chi2(5)        =      41.56
                                                Prob > chi2       =     0.0000
Log likelihood = -229.20658                     Pseudo R2         =     0.0831

------------------------------------------------------------------------------
       admit |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         gre |   .0013756   .0006489     2.12   0.034     .0001038    .0026473
         gpa |   .4777302   .1954625     2.44   0.015     .0946308    .8608297
             |
        rank |
          2  |  -.4153992   .1953769    -2.13   0.033    -.7983308   -.0324675
          3  |   -.812138   .2085956    -3.89   0.000    -1.220978   -.4032981
          4  |   -.935899   .2456339    -3.81   0.000    -1.417333   -.4544654
             |
       _cons |  -2.386838   .6740879    -3.54   0.000    -3.708026   -1.065649
------------------------------------------------------------------------------

. 

R

gradSchool <- read_dta('gradSchool.dta')
probit1 <- glm(admit ~ gre + gpa + factor(rank), data = gradSchool,
              family = binomial(link = "probit"))
summary(probit1)

Call:
glm(formula = admit ~ gre + gpa + factor(rank), family = binomial(link = "probit"), 
    data = gradSchool)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6163  -0.8710  -0.6389   1.1560   2.1035  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.386836   0.673946  -3.542 0.000398 ***
gre            0.001376   0.000650   2.116 0.034329 *  
gpa            0.477730   0.197197   2.423 0.015410 *  
factor(rank)2 -0.415399   0.194977  -2.131 0.033130 *  
factor(rank)3 -0.812138   0.208358  -3.898 9.71e-05 ***
factor(rank)4 -0.935899   0.245272  -3.816 0.000136 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.41  on 394  degrees of freedom
AIC: 470.41

Number of Fisher Scoring iterations: 4
# change reference category
gradSchool$f_rank <- factor(gradSchool$rank)
gradSchool$f_rank <- relevel(gradSchool$f_rank, ref = "3")
probit2 <- glm(admit ~ gre + gpa + f_rank, data = gradSchool,
              family = binomial(link = "probit"))
summary(probit2)

Call:
glm(formula = admit ~ gre + gpa + f_rank, family = binomial(link = "probit"), 
    data = gradSchool)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6163  -0.8710  -0.6389   1.1560   2.1035  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.198975   0.670818  -4.769 1.85e-06 ***
gre          0.001376   0.000650   2.116   0.0343 *  
gpa          0.477730   0.197197   2.423   0.0154 *  
f_rank1      0.812138   0.208358   3.898 9.71e-05 ***
f_rank2      0.396739   0.168054   2.361   0.0182 *  
f_rank4     -0.123761   0.223955  -0.553   0.5805    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 458.41  on 394  degrees of freedom
AIC: 470.41

Number of Fisher Scoring iterations: 4

Multinomial Regression

Multinomial Regression

Row

Stata


. use https://stats.idre.ucla.edu/stat/data/hsbdemo, cl(highschool and beyond (200 cases))

. save hsbdemo, replace
file hsbdemo.dta saved

. mlogit prog i.ses write, base(2)

Iteration 0:   log likelihood = -204.09667  
Iteration 1:   log likelihood = -180.80105  
Iteration 2:   log likelihood = -179.98724  
Iteration 3:   log likelihood = -179.98173  
Iteration 4:   log likelihood = -179.98173  

Multinomial logistic regression                 Number of obs     =        200
                                                LR chi2(6)        =      48.23
                                                Prob > chi2       =     0.0000
Log likelihood = -179.98173                     Pseudo R2         =     0.1182

------------------------------------------------------------------------------
        prog |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
general      |
         ses |
     middle  |   -.533291   .4437321    -1.20   0.229     -1.40299     .336408
       high  |  -1.162832   .5142195    -2.26   0.024    -2.170684   -.1549804
             |
       write |  -.0579284   .0214109    -2.71   0.007    -.0998931   -.0159637
       _cons |   2.852186   1.166439     2.45   0.014     .5660075    5.138365
-------------+----------------------------------------------------------------
academic     |  (base outcome)
-------------+----------------------------------------------------------------
vocation     |
         ses |
     middle  |   .2913931   .4763737     0.61   0.541    -.6422822    1.225068
       high  |  -.9826703   .5955669    -1.65   0.099     -2.14996    .1846195
             |
       write |  -.1136026   .0222199    -5.11   0.000    -.1571528   -.0700524
       _cons |     5.2182   1.163549     4.48   0.000     2.937686    7.498714
------------------------------------------------------------------------------

. 

R

hsbdemo <- read_dta('hsbdemo.dta')
summary(hsbdemo)
       id             female           ses            schtyp    
 Min.   :  1.00   Min.   :0.000   Min.   :1.000   Min.   :1.00  
 1st Qu.: 50.75   1st Qu.:0.000   1st Qu.:2.000   1st Qu.:1.00  
 Median :100.50   Median :1.000   Median :2.000   Median :1.00  
 Mean   :100.50   Mean   :0.545   Mean   :2.055   Mean   :1.16  
 3rd Qu.:150.25   3rd Qu.:1.000   3rd Qu.:3.000   3rd Qu.:1.00  
 Max.   :200.00   Max.   :1.000   Max.   :3.000   Max.   :2.00  
      prog            read           write            math      
 Min.   :1.000   Min.   :28.00   Min.   :31.00   Min.   :33.00  
 1st Qu.:2.000   1st Qu.:44.00   1st Qu.:45.75   1st Qu.:45.00  
 Median :2.000   Median :50.00   Median :54.00   Median :52.00  
 Mean   :2.025   Mean   :52.23   Mean   :52.77   Mean   :52.65  
 3rd Qu.:2.250   3rd Qu.:60.00   3rd Qu.:60.00   3rd Qu.:59.00  
 Max.   :3.000   Max.   :76.00   Max.   :67.00   Max.   :75.00  
    science          socst           honors          awards          cid       
 Min.   :26.00   Min.   :26.00   Min.   :0.000   Min.   :0.00   Min.   : 1.00  
 1st Qu.:44.00   1st Qu.:46.00   1st Qu.:0.000   1st Qu.:0.00   1st Qu.: 5.00  
 Median :53.00   Median :52.00   Median :0.000   Median :1.00   Median :10.50  
 Mean   :51.85   Mean   :52.41   Mean   :0.265   Mean   :1.67   Mean   :10.43  
 3rd Qu.:58.00   3rd Qu.:61.00   3rd Qu.:1.000   3rd Qu.:2.00   3rd Qu.:15.00  
 Max.   :74.00   Max.   :71.00   Max.   :1.000   Max.   :7.00   Max.   :20.00  
hsbdemo$f_prog <- factor( hsbdemo$prog)
hsbdemo$f_prog <- relevel(hsbdemo$f_prog, ref = "2")
mnlog1 <- multinom(f_prog ~ factor(ses) + write, data = hsbdemo)
# weights:  15 (8 variable)
initial  value 219.722458 
iter  10 value 179.982880
final  value 179.981726 
converged
summary(mnlog1)
Call:
multinom(formula = f_prog ~ factor(ses) + write, data = hsbdemo)

Coefficients:
  (Intercept) factor(ses)2 factor(ses)3      write
1    2.852198   -0.5332810   -1.1628226 -0.0579287
3    5.218260    0.2913859   -0.9826649 -0.1136037

Std. Errors:
  (Intercept) factor(ses)2 factor(ses)3      write
1    1.166441    0.4437323    0.5142196 0.02141097
3    1.163552    0.4763739    0.5955665 0.02221996

Residual Deviance: 359.9635 
AIC: 375.9635 

Poisson Regression

Poisson Regression

Row

Stata


. use https://stats.idre.ucla.edu/stat/stata/dae/nb_data, cl. save schoolAbsence, replace
file schoolAbsence.dta saved

. poisson daysabs math i.prog

Iteration 0:   log likelihood = -1328.6751  
Iteration 1:   log likelihood = -1328.6425  
Iteration 2:   log likelihood = -1328.6425  

Poisson regression                              Number of obs     =        314
                                                LR chi2(3)        =     443.73
                                                Prob > chi2       =     0.0000
Log likelihood = -1328.6425                     Pseudo R2         =     0.1431

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        math |  -.0068084   .0009311    -7.31   0.000    -.0086332   -.0049835
             |
        prog |
          2  |  -.4398975    .056672    -7.76   0.000    -.5509725   -.3288224
          3  |  -1.281364   .0778898   -16.45   0.000    -1.434025   -1.128703
             |
       _cons |   2.651974   .0607367    43.66   0.000     2.532932    2.771015
------------------------------------------------------------------------------

. 
. 

R

schoolAbsence <- read_dta('schoolAbsence.dta')
pois1 <- glm(daysabs ~ math + factor(prog), data = schoolAbsence,
             family=poisson(link = "log"))
summary(pois1)

Call:
glm(formula = daysabs ~ math + factor(prog), family = poisson(link = "log"), 
    data = schoolAbsence)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.2597  -2.2038  -0.9193   0.6511   7.4233  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    2.651974   0.060736  43.664  < 2e-16 ***
math          -0.006808   0.000931  -7.313 2.62e-13 ***
factor(prog)2 -0.439897   0.056672  -7.762 8.35e-15 ***
factor(prog)3 -1.281364   0.077886 -16.452  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2217.7  on 313  degrees of freedom
Residual deviance: 1774.0  on 310  degrees of freedom
AIC: 2665.3

Number of Fisher Scoring iterations: 5

Negative Binomial Regression

Negative Binomial Regression

Row

Stata


. use schoolAbsence, repl. nbreg daysabs math i.prog

Fitting Poisson model:

Iteration 0:   log likelihood = -1328.6751  
Iteration 1:   log likelihood = -1328.6425  
Iteration 2:   log likelihood = -1328.6425  

Fitting constant-only model:

Iteration 0:   log likelihood = -899.27009  
Iteration 1:   log likelihood = -896.47264  
Iteration 2:   log likelihood = -896.47237  
Iteration 3:   log likelihood = -896.47237  

Fitting full model:

Iteration 0:   log likelihood = -870.49809  
Iteration 1:   log likelihood = -865.90381  
Iteration 2:   log likelihood = -865.62942  
Iteration 3:   log likelihood =  -865.6289  
Iteration 4:   log likelihood =  -865.6289  

Negative binomial regression                    Number of obs     =        314
                                                LR chi2(3)        =      61.69
Dispersion     = mean                           Prob > chi2       =     0.0000
Log likelihood =  -865.6289                     Pseudo R2         =     0.0344

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        math |   -.005993   .0025072    -2.39   0.017     -.010907    -.001079
             |
        prog |
          2  |    -.44076    .182576    -2.41   0.016    -.7986025   -.0829175
          3  |  -1.278651   .2019811    -6.33   0.000    -1.674526    -.882775
             |
       _cons |   2.615265   .1963519    13.32   0.000     2.230423    3.000108
-------------+----------------------------------------------------------------
    /lnalpha |  -.0321895   .1027882                     -.2336506    .1692717
-------------+----------------------------------------------------------------
       alpha |   .9683231   .0995322                      .7916384    1.184442
------------------------------------------------------------------------------
LR test of alpha=0: chibar2(01) = 926.03               Prob >= chibar2 = 0.000

. 

R

schoolAbsence <- read_dta('schoolAbsence.dta')
nbreg1 <- glm.nb(daysabs ~ math + factor(prog), data = schoolAbsence)
summary(nbreg1)

Call:
glm.nb(formula = daysabs ~ math + factor(prog), data = schoolAbsence, 
    init.theta = 1.032713156, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1547  -1.0192  -0.3694   0.2285   2.5273  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    2.615265   0.197460  13.245  < 2e-16 ***
math          -0.005993   0.002505  -2.392   0.0167 *  
factor(prog)2 -0.440760   0.182610  -2.414   0.0158 *  
factor(prog)3 -1.278651   0.200720  -6.370 1.89e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.0327) family taken to be 1)

    Null deviance: 427.54  on 313  degrees of freedom
Residual deviance: 358.52  on 310  degrees of freedom
AIC: 1741.3

Number of Fisher Scoring iterations: 1


              Theta:  1.033 
          Std. Err.:  0.106 

 2 x log-likelihood:  -1731.258 
# Stata parameterization is 1 / theta