1. The Linear Regression Model: An Overview

1. The Linear Regression Model: An Overview#

suppressPackageStartupMessages({
  library(tidyverse)
  library(tidymodels)
  library(readxl)
  library(multcomp)
  library(car)
})

Wage Function#

Wage: Hourly wage in dollars, which is the dependent variable.
Female: Gender, coded 1 for female, 0 for male
Nonwhite: Race, coded 1 for nonwhite workers, 0 for white workers
Union: Union status, coded 1 if in a union job, 0 otherwise
Education: Education (in years)

df <- read_excel("data/Table1_1.xls")
head(df)
A tibble: 6 × 17
obswagefemalenonwhiteunioneducationexperagewindfemalenonwlnwageeducation_exper_Ifemale_1_IfemXeduca_1_IfemXexper_1_Inonwhite_1_InonXeduca_1
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
111.55100122038102.446686240112200 0
2 5.00000 9 924001.609438 810 0 00 0
312.00000161537102.4849072400 0 00 0
4 7.00011143858001.9459105320 0 0114
521.15110161941113.05164030411619116
6 6.9210012 422101.934416 48112 40 0
df |> glimpse()
Rows: 1,289
Columns: 17
$ obs             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ wage            <dbl> 11.55, 5.00, 12.00, 7.00, 21.15, 6.92, 10.00, 8.00, 15…
$ female          <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, …
$ nonwhite        <dbl> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, …
$ union           <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, …
$ education       <dbl> 12, 9, 16, 14, 16, 12, 12, 12, 18, 18, 20, 12, 5, 12, …
$ exper           <dbl> 20, 9, 15, 38, 19, 4, 14, 32, 7, 5, 31, 7, 31, 14, 15,…
$ age             <dbl> 38, 24, 37, 58, 41, 22, 32, 50, 31, 29, 57, 25, 42, 32…
$ wind            <dbl> 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
$ femalenonw      <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ lnwage          <dbl> 2.446686, 1.609438, 2.484907, 1.945910, 3.051640, 1.93…
$ education_exper <dbl> 240, 81, 240, 532, 304, 48, 168, 384, 126, 90, 620, 84…
$ `_Ifemale_1`    <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, …
$ `_IfemXeduca_1` <dbl> 12, 0, 0, 0, 16, 12, 12, 12, 0, 18, 0, 12, 0, 0, 0, 14…
$ `_IfemXexper_1` <dbl> 20, 0, 0, 0, 19, 4, 14, 32, 0, 5, 0, 7, 0, 0, 0, 26, 0…
$ `_Inonwhite_1`  <dbl> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, …
$ `_InonXeduca_1` <dbl> 0, 0, 0, 14, 16, 0, 0, 12, 0, 0, 0, 0, 0, 12, 0, 14, 1…
df |> summary()
      obs            wage           female          nonwhite     
 Min.   :   1   Min.   : 0.84   Min.   :0.0000   Min.   :0.0000  
 1st Qu.: 323   1st Qu.: 6.92   1st Qu.:0.0000   1st Qu.:0.0000  
 Median : 645   Median :10.08   Median :0.0000   Median :0.0000  
 Mean   : 645   Mean   :12.37   Mean   :0.4973   Mean   :0.1528  
 3rd Qu.: 967   3rd Qu.:15.63   3rd Qu.:1.0000   3rd Qu.:0.0000  
 Max.   :1289   Max.   :64.08   Max.   :1.0000   Max.   :1.0000  
     union         education         exper            age       
 Min.   :0.000   Min.   : 0.00   Min.   : 0.00   Min.   :18.00  
 1st Qu.:0.000   1st Qu.:12.00   1st Qu.: 9.00   1st Qu.:29.00  
 Median :0.000   Median :12.00   Median :18.00   Median :37.00  
 Mean   :0.159   Mean   :13.15   Mean   :18.79   Mean   :37.93  
 3rd Qu.:0.000   3rd Qu.:16.00   3rd Qu.:27.00   3rd Qu.:47.00  
 Max.   :1.000   Max.   :20.00   Max.   :56.00   Max.   :65.00  
      wind          femalenonw          lnwage        education_exper
 Min.   :0.0000   Min.   :0.00000   Min.   :-0.1744   Min.   :  0.0  
 1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.: 1.9344   1st Qu.:120.0  
 Median :0.0000   Median :0.00000   Median : 2.3106   Median :228.0  
 Mean   :0.4073   Mean   :0.08379   Mean   : 2.3424   Mean   :241.1  
 3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.: 2.7492   3rd Qu.:352.0  
 Max.   :1.0000   Max.   :1.00000   Max.   : 4.1601   Max.   :740.0  
   _Ifemale_1     _IfemXeduca_1    _IfemXexper_1     _Inonwhite_1   
 Min.   :0.0000   Min.   : 0.000   Min.   : 0.000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:0.0000  
 Median :0.0000   Median : 0.000   Median : 0.000   Median :0.0000  
 Mean   :0.4973   Mean   : 6.493   Mean   : 9.212   Mean   :0.1528  
 3rd Qu.:1.0000   3rd Qu.:12.000   3rd Qu.:17.000   3rd Qu.:0.0000  
 Max.   :1.0000   Max.   :20.000   Max.   :56.000   Max.   :1.0000  
 _InonXeduca_1   
 Min.   : 0.000  
 1st Qu.: 0.000  
 Median : 0.000  
 Mean   : 1.921  
 3rd Qu.: 0.000  
 Max.   :20.000  

Exercises#

1.1.#

Consider the regression results given in Table 1.2.

model <- lm(wage~female+nonwhite+union+education+exper, data = df)
model |> 
summary()
Call:
lm(formula = wage ~ female + nonwhite + union + education + exper, 
    data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.781  -3.760  -1.044   2.418  50.414 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7.18334    1.01579  -7.072 2.51e-12 ***
female      -3.07488    0.36462  -8.433  < 2e-16 ***
nonwhite    -1.56531    0.50919  -3.074  0.00216 ** 
union        1.09598    0.50608   2.166  0.03052 *  
education    1.37030    0.06590  20.792  < 2e-16 ***
exper        0.16661    0.01605  10.382  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.508 on 1283 degrees of freedom
Multiple R-squared:  0.3233,	Adjusted R-squared:  0.3207 
F-statistic: 122.6 on 5 and 1283 DF,  p-value: < 2.2e-16
tidy(model)
A tibble: 6 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept)-7.18333821.01578786-7.0716912.508276e-12
female -3.07487540.36461621-8.4331848.939423e-17
nonwhite -1.56531330.50918754-3.0741392.155664e-03
union 1.09597580.50607809 2.1656263.052356e-02
education 1.37030100.0659042120.7923125.507613e-83
exper 0.16660650.0160475610.3820502.659960e-24

the intercept value is about –7.1833.
Literally interpreted, this would mean the average hourly wage is –$7.1833
if the values of all the regressors in that regression are set equal to zero

The female coefficient –3.07 means, holding all other variables constant, that the
average female hourly wage is lower than the average male hourly wage by about 3
dollars, ceteris paribus.

glance(model)
A tibble: 1 × 12
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
0.32333880.32070186.508137122.61493.453151e-1065-4240.378494.7418530.87254342.5412831289
augment(model) |> 
head()
A tibble: 6 × 12
wagefemalenonwhiteunioneducationexper.fitted.resid.hat.sigma.cooksd.std.resid
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
11.551001220 9.517529 2.032471600.0019429526.5104263.170558e-05 0.31260084
5.00000 9 9 6.648829 -1.648829300.0046610056.5105115.032970e-05-0.25394141
12.00000161517.240575 -5.240575290.0025693886.5090252.790986e-04-0.80627086
7.00011143817.862586-10.862585930.0111788196.5035225.308384e-03-1.67848587
21.15110161913.266813 7.883187010.0073290896.5069231.818773e-03 1.21574509
6.9210012 4 6.851824 0.068175950.0033185296.5106746.109851e-08 0.01049292

a.#

Suppose you want to test the hypothesis that the true or population regression coefficient of the education variable is 1.
How would you test this hypothesis? Show the necessary calculations.

\[ wage = \beta_0 + \beta_1~ \text{female} + \beta_2~ \text{nonwhite} + \beta_3~ \text{union} + \beta_4~ \text{education} + \beta_5~ \text{exper} \]

\begin{align} H_0: \beta_4 = 1 \ H_1: \beta_4 \neq 1 \end{align}

\begin{align} t &= \frac{b - b_0}{se(b)}\ t &= \frac{1.3703010 - 1}{0.06590421} = 5.619 \end{align}

hypothesis_test <- linearHypothesis(model, "education = 1")
hypothesis_test
A anova: 2 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
1128455679.75NA NA NA NA
2128354342.54 11337.20131.570642.356207e-08
test <- glht(model, linfct = c("education = 1"))
summary(test)
	 Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = wage ~ female + nonwhite + union + education + exper, 
    data = df)

Linear Hypotheses:
               Estimate Std. Error t value Pr(>|t|)    
education == 1   1.3703     0.0659   5.619 2.36e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
# manual calculation

coef <- summary(model)$coefficients
beta_education <- coef["education", "Estimate"]
se_education <- coef["education", "Std. Error"]


t_value <- (beta_education - 1) / se_education

degreesfreedom <- df.residual(model)

p_value <- 2 * pt(-abs(t_value), degreesfreedom)

t_value
p_value
5.61877531138879
2.35620718007122e-08

b#

Would you reject or not reject the hypothesis that the true union regression coefficient is 1?

test_union <- glht(model, linfct = c("union = 1"))
summary(test_union)
	 Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = wage ~ female + nonwhite + union + education + exper, 
    data = df)

Linear Hypotheses:
           Estimate Std. Error t value Pr(>|t|)
union == 1   1.0960     0.5061    0.19     0.85
(Adjusted p values reported -- single-step method)

Fail to reject

c.#

Can you take the logs of the nominal variables, such as gender, race and union status? Why or why not?

no, not numerical data

d.#

What other variables are missing from the model?

df |> colnames()
  1. 'obs'
  2. 'wage'
  3. 'female'
  4. 'nonwhite'
  5. 'union'
  6. 'education'
  7. 'exper'
  8. 'age'
  9. 'wind'
  10. 'femalenonw'
  11. 'lnwage'
  12. 'education_exper'
  13. '_Ifemale_1'
  14. '_IfemXeduca_1'
  15. '_IfemXexper_1'
  16. '_Inonwhite_1'
  17. '_InonXeduca_1'

age, wind, femalenonw, lnwage, education_exper

e.#

Would you run separate wage regressions for white and nonwhite workers,
male and female workers, and union and non-union workers? And how would you compare them?

using f test

f.#

Some states have right-to-work laws (i.e., union membership is not mandatory) and some
do not have such laws (i.e, union membership is permitted). Is it worth adding a dummy
variable taking the value of 1 if the right-to-work laws are present and 0 otherwise? A priori,
what would you expect if this variable is added to the model?

yes, if it can affect the dependent variable wage

h.#

Would you add the age of the worker as an explanatory variable to the model? Why or why not?

no, \(experience = age - education - 6 \), will lead to perfect collinearity