1. The Linear Regression Model: An Overview#
suppressPackageStartupMessages({
library(tidyverse)
library(tidymodels)
library(readxl)
library(multcomp)
library(car)
})
Wage Function#
Wage: Hourly wage in dollars, which is the dependent variable.
Female: Gender, coded 1 for female, 0 for male
Nonwhite: Race, coded 1 for nonwhite workers, 0 for white workers
Union: Union status, coded 1 if in a union job, 0 otherwise
Education: Education (in years)
df <- read_excel("data/Table1_1.xls")
head(df)
| obs | wage | female | nonwhite | union | education | exper | age | wind | femalenonw | lnwage | education_exper | _Ifemale_1 | _IfemXeduca_1 | _IfemXexper_1 | _Inonwhite_1 | _InonXeduca_1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 1 | 11.55 | 1 | 0 | 0 | 12 | 20 | 38 | 1 | 0 | 2.446686 | 240 | 1 | 12 | 20 | 0 | 0 |
| 2 | 5.00 | 0 | 0 | 0 | 9 | 9 | 24 | 0 | 0 | 1.609438 | 81 | 0 | 0 | 0 | 0 | 0 |
| 3 | 12.00 | 0 | 0 | 0 | 16 | 15 | 37 | 1 | 0 | 2.484907 | 240 | 0 | 0 | 0 | 0 | 0 |
| 4 | 7.00 | 0 | 1 | 1 | 14 | 38 | 58 | 0 | 0 | 1.945910 | 532 | 0 | 0 | 0 | 1 | 14 |
| 5 | 21.15 | 1 | 1 | 0 | 16 | 19 | 41 | 1 | 1 | 3.051640 | 304 | 1 | 16 | 19 | 1 | 16 |
| 6 | 6.92 | 1 | 0 | 0 | 12 | 4 | 22 | 1 | 0 | 1.934416 | 48 | 1 | 12 | 4 | 0 | 0 |
df |> glimpse()
Rows: 1,289
Columns: 17
$ obs <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ wage <dbl> 11.55, 5.00, 12.00, 7.00, 21.15, 6.92, 10.00, 8.00, 15…
$ female <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, …
$ nonwhite <dbl> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, …
$ union <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, …
$ education <dbl> 12, 9, 16, 14, 16, 12, 12, 12, 18, 18, 20, 12, 5, 12, …
$ exper <dbl> 20, 9, 15, 38, 19, 4, 14, 32, 7, 5, 31, 7, 31, 14, 15,…
$ age <dbl> 38, 24, 37, 58, 41, 22, 32, 50, 31, 29, 57, 25, 42, 32…
$ wind <dbl> 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
$ femalenonw <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ lnwage <dbl> 2.446686, 1.609438, 2.484907, 1.945910, 3.051640, 1.93…
$ education_exper <dbl> 240, 81, 240, 532, 304, 48, 168, 384, 126, 90, 620, 84…
$ `_Ifemale_1` <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, …
$ `_IfemXeduca_1` <dbl> 12, 0, 0, 0, 16, 12, 12, 12, 0, 18, 0, 12, 0, 0, 0, 14…
$ `_IfemXexper_1` <dbl> 20, 0, 0, 0, 19, 4, 14, 32, 0, 5, 0, 7, 0, 0, 0, 26, 0…
$ `_Inonwhite_1` <dbl> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, …
$ `_InonXeduca_1` <dbl> 0, 0, 0, 14, 16, 0, 0, 12, 0, 0, 0, 0, 0, 12, 0, 14, 1…
df |> summary()
obs wage female nonwhite
Min. : 1 Min. : 0.84 Min. :0.0000 Min. :0.0000
1st Qu.: 323 1st Qu.: 6.92 1st Qu.:0.0000 1st Qu.:0.0000
Median : 645 Median :10.08 Median :0.0000 Median :0.0000
Mean : 645 Mean :12.37 Mean :0.4973 Mean :0.1528
3rd Qu.: 967 3rd Qu.:15.63 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :1289 Max. :64.08 Max. :1.0000 Max. :1.0000
union education exper age
Min. :0.000 Min. : 0.00 Min. : 0.00 Min. :18.00
1st Qu.:0.000 1st Qu.:12.00 1st Qu.: 9.00 1st Qu.:29.00
Median :0.000 Median :12.00 Median :18.00 Median :37.00
Mean :0.159 Mean :13.15 Mean :18.79 Mean :37.93
3rd Qu.:0.000 3rd Qu.:16.00 3rd Qu.:27.00 3rd Qu.:47.00
Max. :1.000 Max. :20.00 Max. :56.00 Max. :65.00
wind femalenonw lnwage education_exper
Min. :0.0000 Min. :0.00000 Min. :-0.1744 Min. : 0.0
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.: 1.9344 1st Qu.:120.0
Median :0.0000 Median :0.00000 Median : 2.3106 Median :228.0
Mean :0.4073 Mean :0.08379 Mean : 2.3424 Mean :241.1
3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.: 2.7492 3rd Qu.:352.0
Max. :1.0000 Max. :1.00000 Max. : 4.1601 Max. :740.0
_Ifemale_1 _IfemXeduca_1 _IfemXexper_1 _Inonwhite_1
Min. :0.0000 Min. : 0.000 Min. : 0.000 Min. :0.0000
1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.:0.0000
Median :0.0000 Median : 0.000 Median : 0.000 Median :0.0000
Mean :0.4973 Mean : 6.493 Mean : 9.212 Mean :0.1528
3rd Qu.:1.0000 3rd Qu.:12.000 3rd Qu.:17.000 3rd Qu.:0.0000
Max. :1.0000 Max. :20.000 Max. :56.000 Max. :1.0000
_InonXeduca_1
Min. : 0.000
1st Qu.: 0.000
Median : 0.000
Mean : 1.921
3rd Qu.: 0.000
Max. :20.000
Exercises#
1.1.#
Consider the regression results given in Table 1.2.
model <- lm(wage~female+nonwhite+union+education+exper, data = df)
model |>
summary()
Call:
lm(formula = wage ~ female + nonwhite + union + education + exper,
data = df)
Residuals:
Min 1Q Median 3Q Max
-20.781 -3.760 -1.044 2.418 50.414
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.18334 1.01579 -7.072 2.51e-12 ***
female -3.07488 0.36462 -8.433 < 2e-16 ***
nonwhite -1.56531 0.50919 -3.074 0.00216 **
union 1.09598 0.50608 2.166 0.03052 *
education 1.37030 0.06590 20.792 < 2e-16 ***
exper 0.16661 0.01605 10.382 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.508 on 1283 degrees of freedom
Multiple R-squared: 0.3233, Adjusted R-squared: 0.3207
F-statistic: 122.6 on 5 and 1283 DF, p-value: < 2.2e-16
tidy(model)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| (Intercept) | -7.1833382 | 1.01578786 | -7.071691 | 2.508276e-12 |
| female | -3.0748754 | 0.36461621 | -8.433184 | 8.939423e-17 |
| nonwhite | -1.5653133 | 0.50918754 | -3.074139 | 2.155664e-03 |
| union | 1.0959758 | 0.50607809 | 2.165626 | 3.052356e-02 |
| education | 1.3703010 | 0.06590421 | 20.792312 | 5.507613e-83 |
| exper | 0.1666065 | 0.01604756 | 10.382050 | 2.659960e-24 |
the intercept value is about –7.1833.
Literally interpreted, this would mean the average hourly wage is –$7.1833
if the values of all the regressors in that regression are set equal to zero
The female coefficient –3.07 means, holding all other variables constant, that the
average female hourly wage is lower than the average male hourly wage by about 3
dollars, ceteris paribus.
glance(model)
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <int> | <int> |
| 0.3233388 | 0.3207018 | 6.508137 | 122.6149 | 3.453151e-106 | 5 | -4240.37 | 8494.741 | 8530.872 | 54342.54 | 1283 | 1289 |
augment(model) |>
head()
| wage | female | nonwhite | union | education | exper | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 11.55 | 1 | 0 | 0 | 12 | 20 | 9.517529 | 2.03247160 | 0.001942952 | 6.510426 | 3.170558e-05 | 0.31260084 |
| 5.00 | 0 | 0 | 0 | 9 | 9 | 6.648829 | -1.64882930 | 0.004661005 | 6.510511 | 5.032970e-05 | -0.25394141 |
| 12.00 | 0 | 0 | 0 | 16 | 15 | 17.240575 | -5.24057529 | 0.002569388 | 6.509025 | 2.790986e-04 | -0.80627086 |
| 7.00 | 0 | 1 | 1 | 14 | 38 | 17.862586 | -10.86258593 | 0.011178819 | 6.503522 | 5.308384e-03 | -1.67848587 |
| 21.15 | 1 | 1 | 0 | 16 | 19 | 13.266813 | 7.88318701 | 0.007329089 | 6.506923 | 1.818773e-03 | 1.21574509 |
| 6.92 | 1 | 0 | 0 | 12 | 4 | 6.851824 | 0.06817595 | 0.003318529 | 6.510674 | 6.109851e-08 | 0.01049292 |
a.#
Suppose you want to test the hypothesis that the true or population regression coefficient of
the education variable is 1.
How would you test this hypothesis? Show the necessary
calculations.
\begin{align} H_0: \beta_4 = 1 \ H_1: \beta_4 \neq 1 \end{align}
\begin{align} t &= \frac{b - b_0}{se(b)}\ t &= \frac{1.3703010 - 1}{0.06590421} = 5.619 \end{align}
hypothesis_test <- linearHypothesis(model, "education = 1")
hypothesis_test
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| 1 | 1284 | 55679.75 | NA | NA | NA | NA |
| 2 | 1283 | 54342.54 | 1 | 1337.201 | 31.57064 | 2.356207e-08 |
test <- glht(model, linfct = c("education = 1"))
summary(test)
Simultaneous Tests for General Linear Hypotheses
Fit: lm(formula = wage ~ female + nonwhite + union + education + exper,
data = df)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
education == 1 1.3703 0.0659 5.619 2.36e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
# manual calculation
coef <- summary(model)$coefficients
beta_education <- coef["education", "Estimate"]
se_education <- coef["education", "Std. Error"]
t_value <- (beta_education - 1) / se_education
degreesfreedom <- df.residual(model)
p_value <- 2 * pt(-abs(t_value), degreesfreedom)
t_value
p_value
b#
Would you reject or not reject the hypothesis that the true union regression coefficient is 1?
test_union <- glht(model, linfct = c("union = 1"))
summary(test_union)
Simultaneous Tests for General Linear Hypotheses
Fit: lm(formula = wage ~ female + nonwhite + union + education + exper,
data = df)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
union == 1 1.0960 0.5061 0.19 0.85
(Adjusted p values reported -- single-step method)
Fail to reject
c.#
Can you take the logs of the nominal variables, such as gender, race and union status? Why or why not?
no, not numerical data
d.#
What other variables are missing from the model?
df |> colnames()
- 'obs'
- 'wage'
- 'female'
- 'nonwhite'
- 'union'
- 'education'
- 'exper'
- 'age'
- 'wind'
- 'femalenonw'
- 'lnwage'
- 'education_exper'
- '_Ifemale_1'
- '_IfemXeduca_1'
- '_IfemXexper_1'
- '_Inonwhite_1'
- '_InonXeduca_1'
age, wind, femalenonw, lnwage, education_exper
e.#
Would you run separate wage regressions for white and nonwhite workers,
male and female workers, and union and non-union workers? And how
would you compare them?
using f test
f.#
Some states have right-to-work laws (i.e., union membership is not mandatory) and some
do not have such laws (i.e, union membership is permitted). Is it worth adding a dummy
variable taking the value of 1 if the right-to-work laws are present and 0 otherwise? A priori,
what would you expect if this variable is added to the model?
yes, if it can affect the dependent variable wage
h.#
Would you add the age of the worker as an explanatory variable to the model? Why or why not?
no, \(experience = age - education - 6 \), will lead to perfect collinearity