3. Qualitative Explanatory Variables Regression Models#

suppressPackageStartupMessages({
  library(tidyverse)
  library(tidymodels)
  library(readxl)
  library(multcomp)
  library(car)
})

Wage function#

\[ Wage_i = \beta_1 + \beta_2D_{2i} + \beta_3D_{3i} + \beta_4D_{4i} + \beta_5Educ_i + \beta_6Exper_i + ui \]

\(D_2\) = 1 if female, 0 for male
\(D_3\) = 1 for nonwhite, 0 for white \(D_4\) = 1 union member, 0 non union All \(D\) are dummy variables

df <- read_excel("data/Table1_1.xls")
head(df,3)
A tibble: 3 × 17
obswagefemalenonwhiteunioneducationexperagewindfemalenonwlnwageeducation_exper_Ifemale_1_IfemXeduca_1_IfemXexper_1_Inonwhite_1_InonXeduca_1
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
111.55100122038102.4466862401122000
2 5.00000 9 924001.609438 810 0 000
312.00000161537102.4849072400 0 000
model <- lm(wage~ female + nonwhite + union + education + exper, data = df)
tidy(model)
glance(model)
A tibble: 6 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept)-7.18333821.01578786-7.0716912.508276e-12
female -3.07487540.36461621-8.4331848.939423e-17
nonwhite -1.56531330.50918754-3.0741392.155664e-03
union 1.09597580.50607809 2.1656263.052356e-02
education 1.37030100.0659042120.7923125.507613e-83
exper 0.16660650.0160475610.3820502.659960e-24
A tibble: 1 × 12
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
0.32333880.32070186.508137122.61493.453151e-1065-4240.378494.7418530.87254342.5412831289

female dummy variable: average wage of female is lower by $3.07 as compared to average wage of a male(reference category), ceteris paribus

union dummy variable: average wage of union is higher by $1.10 as compared to average wage of non union workers, , ceteris paribus

nonwhite dummy variable: average wage of nonwhite worker is lower by $1.56 as compared to average wage of white workers, , ceteris paribus

Intercept: –7.18 isthe expected hourly wage for white, non-union, male worker

called differential intercept dummies

Refinement of the wage function#

Interactive dummy: female non white worker

model2 <- lm(wage~ female + nonwhite + union + education + exper + female*nonwhite, data = df)
tidy(model2)
glance(model2)
A tibble: 7 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept) -7.0887251.01948176-6.9532645.671212e-12
female -3.2401480.39532778-8.1961065.961063e-16
nonwhite -2.1585250.74842579-2.8840873.991105e-03
union 1.1150440.50635176 2.2021132.783489e-02
education 1.3701130.0659000920.7907615.752227e-83
exper 0.1658560.0160615010.3263074.560482e-24
female:nonwhite 1.0953711.01289673 1.0814242.797118e-01
A tibble: 1 × 12
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
0.32395550.32079156.507707102.38752.252543e-1056-4239.7838495.5658536.85854293.0212821289

Interactive dummy: Being a female has a lower salary by 3.2.
Being a nonwhite worker has a lower salary by 2.15.
Being both a female nonwhite worker has a lower salary (-3.2 - 2.15 + 1.10) = -4.25.

Another refinement of the wage function#

\begin{align} \text{wage}i &= \beta_1 + \beta_2 \text{female} + \beta_3 \text{nonwhite} +\beta_4 \text{union}+\beta_5 \text{educ}\ &+\beta_6 \text{exper} +\beta_7( \text{female * Educ}) +\beta_8 (\text{nonwhite * Educ}) \ &+\beta_9 (\text{union * Educ}) +\beta{10} (\text{female * exper}) +\beta_{11} (\text{nonwhite * exper})\ &+ \beta_{12} (\text{union * exper}) + u_i \end{align}

\(\beta_2, \beta_3, \beta_4\) are diferential intercept dummies
\(\beta_7 \to \beta_{11}\) are differential slope dummies

model3 <- lm(wage~ female + nonwhite + union + education + exper +
             female*education + female*exper +
             nonwhite*education + nonwhite*exper +
             union*education + union*exper, data = df)
tidy(model3)
glance(model3)
A tibble: 12 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept) -11.091285751.42184620-7.80062271.272876e-14
female 3.174157511.96646494 1.61414401.067433e-01
nonwhite 2.909128522.78006592 1.04642432.955632e-01
union 4.454211502.97349392 1.49797231.343876e-01
education 1.587125000.0938193616.91681834.408735e-58
exper 0.220912350.02510676 8.79891864.394876e-18
female:education -0.336887600.13199299-2.55231431.081657e-02
female:exper -0.096124820.03181329-3.02153032.565047e-03
nonwhite:education -0.321855170.19534842-1.64759549.968173e-02
nonwhite:exper -0.022041370.04437560-0.49670016.194861e-01
union:education -0.198323320.19137307-1.03631783.002501e-01
union:exper -0.033453890.04605375-0.72640964.677208e-01
A tibble: 1 × 12
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
0.33281090.32706376.47758957.909093.562619e-10411-4231.2858488.5698555.6753581.8412771289

female*experience: for each additional year of experience, the increase in income for females is 0.09 units less than for males.

Functional form of wage regression#

summary(df$wage)

ggplot(df, aes(x = wage))+ 
        geom_histogram(bins = 30)

shapiro.test(df$wage)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.84    6.92   10.08   12.37   15.63   64.08 
	Shapiro-Wilk normality test

data:  df$wage
W = 0.84488, p-value < 2.2e-16
_images/cba1a6fadc82d6b5d3de6eaf76f8e05cab9287d47b5db2ae104f737ad7550ae1.png
library(moments)
skewness(df$wage)
kurtosis(df$wage)
1.848114127393
7.83656558034119
summary(df$lnwage)
skewness(df$lnwage)
kurtosis(df$lnwage)

ggplot(df, aes(x = lnwage))+ 
        geom_histogram(bins = 30)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1744  1.9344  2.3106  2.3424  2.7492  4.1601 
0.0133949929381925
3.22633731089373
_images/6f4cf8471839e1a4e2dfbd62786509512174e5c81c2e0301ad2e73c7dac5d873.png
library('tseries')
jarque.bera.test(df$lnwage)
shapiro.test(df$lnwage)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
	Jarque Bera Test

data:  df$lnwage
X-squared = 2.7899, df = 2, p-value = 0.2478
	Shapiro-Wilk normality test

data:  df$lnwage
W = 0.99175, p-value = 1.268e-06
model4 <-  lm(lnwage~ female + nonwhite + union + education + exper, data = df)
tidy(model4)
glance(model4)
A tibble: 6 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept) 0.905503680.07417490112.2076831.662727e-32
female -0.249154040.026625019-9.3578913.511043e-20
nonwhite -0.133535100.037181913-3.5913993.413335e-04
union 0.180203540.036954854 4.8763161.216066e-06
education 0.099870300.00481246020.7524421.025031e-82
exper 0.012760090.00117182510.8890731.795174e-26
A tibble: 1 × 12
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
0.34565050.34310040.4752374135.54521.738922e-1155-867.06511748.131784.261289.766312831289

education coefficient of 0.0999 suggests that for every additional year of schooling, the average wage rate goes up by about 9.99%, ceteris paribus

female dummy coefficient of –0.2492 as suggesting that the average female wage rate is lower by 24.92% as compared to the male average wage rate. (approximately)

(exp(-0.2492) - 1) * 100
-22.0575927019416

average female wage rate is lower by 22% as compared to the male average wage rate. (exact)

Dummy variables in structural change#

\[ GPI_t = B_1 + B_2GPS_t +u_t, \]

GPS: gross private savings
GPI: gross private investments

df2 <- read_excel("data/Table3_6.xls")
head(df2,3)
A tibble: 3 × 5
obsgpigpsrecession81gpsrec81
<dbl><dbl><dbl><dbl><dbl>
195978.584.600
196078.984.800
196178.291.800
model5 <- lm(gpi ~ gps, data = df2)
tidy(model5)
glance(model5)
A tibble: 2 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept)-78.72104727.48474252-2.8641736.230637e-03
gps 1.107395 0.0290799338.0810925.567391e-37
A tibble: 1 × 12
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
0.96860740.9679395114.86811450.175.567391e-371-300.9524607.9049613.5803620149.84749

if GPS increases by a dollar, the average GPI goes up by about $1.10.

Check for structural break#

\[ GPI_t = B_1 + B_2GPS_t + \beta_3 \text{Recession81}_t + u_t, \]

GPS: gross private savings
GPI: gross private investments
Recession81: 0 before year 1981, 1 afterwords

model6 <- lm(gpi ~ gps + recession81, data = df2)
tidy(model6)
glance(model6)
A tibble: 3 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept) -83.48602523.15912656-3.6048877.649795e-04
gps 1.288672 0.0470655327.3803804.045379e-30
recession81-240.78784853.39662813-4.5094204.464419e-05
A tibble: 1 × 12
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
0.97823080.977284396.689061033.5385.893647e-392-291.9836591.9672599.5345430043.74649

Intercept pre recession: -83
Intercept post recession: (-83-240) = -324

To check for slope:

\[ GPI_t = B_1 + B_2GPS_t + \beta_3 \text{Recession81}_t + \beta_4 \text{GPS * Recession81}+ u_t, \]
model7 <- lm(gpi ~ gps + recession81 + gps*recession81, data = df2)
tidy(model7)
glance(model7)
A tibble: 4 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept) -7.779868938.4495931-0.20233948.405634e-01
gps 0.9510818 0.1474505 6.45017856.686678e-08
recession81 -357.458652870.2863017-5.08575136.910536e-06
gps:recession81 0.3719201 0.1547662 2.40311022.044029e-02
A tibble: 1 × 12
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
0.98070670.979420592.03045762.47311.42319e-383-289.0255588.051597.5101381132.24549

GPI pre recession: -7.78 + 0.95 GPS
GPI post recession: (-7.78 - 357.45) + (0.95 + 0.37) GPS

Dummy variables in seasonal data (deseasonalization)#

df3 <- read_excel("data/Table3_10.xls")
head(df3,3)
A tibble: 3 × 10
obssalesrpdiconfd2d3d4lnsalesyearqtrend
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1986:153.71484.18286.60003.983674219161
1986:271.50185.04290.31004.269711219162
1986:396.37484.57284.50104.568236219163
\[ Sales_t = A_1 + A_2D_{2t} + A_3D_{3t} + A_4D_{4t} + u_t \]

\(D_2\) = 1 for second quarter
\(D_3\) = 1 for third quarter
\(D_4\) = 1 for fourth quarter

model8 <- lm(sales ~  d2 + d3 + d4, data = df3)
tidy(model8)
glance(model8)
A tibble: 4 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept)73.183433.97748318.3994331.178376e-15
d2 14.692295.625010 2.6119571.528526e-02
d3 27.964725.625010 4.9714964.468310e-05
d4 57.114725.62501010.1537093.646049e-10
A tibble: 1 × 12
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
0.82348830.801424410.5234337.322783.372103e-093-103.4731216.9462223.60722657.8222428

Interpretation ofD2 is that the mean sales value in the second quarter is greater than the mean sales in the first, or reference, quarter by 14.69229 units; the actual mean sales value in the second quarter is (73.18343 + 14.69229) = 87.87572.

Deseasonalize#

  1. obtain estimated sales volume

  2. obtain residuals (actual sales - estimated sales)

  3. add mean value of sales to residuals

augmented <- augment(model8)
augmented <- augmented %>%
mutate(deseasonalized_sales = .resid + mean(augmented$sales))
head(augmented)
A tibble: 6 × 11
salesd2d3d4.fitted.resid.hat.sigma.cooksd.std.residdeseasonalized_sales
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
53.714000 73.18343-19.4694270.1428571 9.8147780.166390008-1.9983394 78.65693
71.501100 87.87571-16.3747140.142857110.0973570.117697809-1.6806985 81.75164
96.374010101.14814 -4.7741440.142857110.6958560.010004880-0.4900175 93.35221
125.041001130.29814 -5.2571430.142857110.6843610.012131669-0.5395925 92.86921
78.610000 73.18343 5.4265730.142857110.6800630.012926240 0.5569827103.55293
89.609100 87.87571 1.7332870.142857110.7426760.001318749 0.1779044 99.85964

Expanded sales function#

Frisch_Waugh Theorem

model9 <- lm(sales ~  rpdi + conf+ d2 + d3 + d4, data = df3)
tidy(model9)
glance(model9)
A tibble: 6 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept)-152.929219352.59149307-2.9078708.156667e-03
rpdi 1.5989033 0.37015528 4.3195472.764247e-04
conf 0.2939096 0.08437568 3.4833442.106581e-03
d2 15.0452205 4.31537708 3.4864212.091102e-03
d3 26.0024741 4.32524385 6.0117944.741138e-06
d4 60.8722628 4.4274371413.7488712.795431e-12
A tibble: 1 × 12
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
0.90537480.8838698.04763742.099221.53222e-105-94.74461203.4892212.81461424.8182228

Piecewise Linear Regression#

introduce threshold (knot)

df4 <- read_excel("data/Table3_16.xls")
head(df4)
A tibble: 6 × 6
obsavgcostlotsizezdzd
<dbl><dbl><dbl><dbl><dbl><dbl>
19.73100-10000
29.61120 -8000
38.15140 -6000
46.98160 -4000
55.87180 -2000
64.98200 000
ggplot(df4, aes(x = lotsize, y = avgcost)) +
geom_point()
_images/ffab134accb97adf35747392263ed5f97bae4abfdbc8290fbdb4d7151a817f27.png
\[ Y_i = \beta_0 + \beta_1 X_i + \beta_2(X_i-X^*)D_i + u_i \]

Y = average production cost,
X = lot size,
X* = threshold value of lot size,
D = 1 if Xi > X*
D = 0 if Xi < X

model10 <- lm(avgcost ~ lotsize, data = df4)
tidy(model10)
glance(model10)
A tibble: 2 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept)12.290908550.74914621916.4065555.167361e-08
lotsize -0.030772730.003571414-8.6163971.217549e-05
A tibble: 1 × 12
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
0.89188190.87986880.749146274.24231.217549e-051-11.327628.6552129.848895.050981911
model10 <- lm(avgcost ~ lotsize + zd, data = df4)
tidy(model10)
glance(model10)
A tibble: 3 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept)15.116479940.535382569 28.2349122.674940e-09
lotsize -0.050198530.003332127-15.0650113.726359e-07
zd 0.038851610.005945966 6.5341121.814740e-04
A tibble: 1 × 12
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
0.98293810.97867270.3156508230.44098.474347e-082-1.17252310.3450511.936630.7970835811

lot size < 200
Average cost = 15.1164 – 0.0502 lot size,

lot size > 200
Average cost = (15.1164 – 0.0385 × 200) + (–0.0502 + 0.0388) lot size = 7.4164 – 0.0114 lot size,

up to a lot size of 200 units, the unit cost decreases by 5 cents per unit increase in the lot size, but beyond the lot size of 200 units, it decreases only by about 1 cent

Exercises#

🚧 Under Construction