import pandas as pd
import numpy as np
import wooldridge
import statsmodels.formula.api as smfdef summarize_model(model):
"""
Returns a summary table with parameters, number of observations, and R-squared.
Parameters:
model: A fitted statsmodels regression result (e.g., OLS, Logit).
Returns:
pd.DataFrame: Summary table.
"""
params = model.params
summary_df = pd.DataFrame(
{
"Parameter": list(params.index) + ["Number of Observations", "R-squared"],
"Estimate": list(params.values) + [int(model.nobs), model.rsquared],
}
)
return summary_dfwooldridge.data() J.M. Wooldridge (2019) Introductory Econometrics: A Modern Approach,
Cengage Learning, 6th edition.
401k 401ksubs admnrev affairs airfare
alcohol apple approval athlet1 athlet2
attend audit barium beauty benefits
beveridge big9salary bwght bwght2 campus
card catholic cement census2000 ceosal1
ceosal2 charity consump corn countymurders
cps78_85 cps91 crime1 crime2 crime3
crime4 discrim driving earns econmath
elem94_95 engin expendshares ezanders ezunem
fair fertil1 fertil2 fertil3 fish
fringe gpa1 gpa2 gpa3 happiness
hprice1 hprice2 hprice3 hseinv htv
infmrt injury intdef intqrt inven
jtrain jtrain2 jtrain3 kielmc lawsch85
loanapp lowbrth mathpnl meap00_01 meap01
meap93 meapsingle minwage mlb1 mroz
murder nbasal nyse okun openness
pension phillips pntsprd prison prminwge
rdchem rdtelec recid rental return
saving sleep75 slp75_81 smoke traffic1
traffic2 twoyear volat vote1 vote2
voucher wage1 wage2 wagepan wageprc
wine
Examples¶
3.1 Determinants of College GPA¶
wooldridge.data("gpa1", description=True)name of dataset: gpa1
no of variables: 29
no of observations: 141
+----------+--------------------------------+
| variable | label |
+----------+--------------------------------+
| age | in years |
| soph | =1 if sophomore |
| junior | =1 if junior |
| senior | =1 if senior |
| senior5 | =1 if fifth year senior |
| male | =1 if male |
| campus | =1 if live on campus |
| business | =1 if business major |
| engineer | =1 if engineering major |
| colGPA | MSU GPA |
| hsGPA | high school GPA |
| ACT | 'achievement' score |
| job19 | =1 if job <= 19 hours |
| job20 | =1 if job >= 20 hours |
| drive | =1 if drive to campus |
| bike | =1 if bicycle to campus |
| walk | =1 if walk to campus |
| voluntr | =1 if do volunteer work |
| PC | =1 of pers computer at sch |
| greek | =1 if fraternity or sorority |
| car | =1 if own car |
| siblings | =1 if have siblings |
| bgfriend | =1 if boy- or girlfriend |
| clubs | =1 if belong to MSU club |
| skipped | avg lectures missed per week |
| alcohol | avg # days per week drink alc. |
| gradMI | =1 if Michigan high school |
| fathcoll | =1 if father college grad |
| mothcoll | =1 if mother college grad |
+----------+--------------------------------+
Christopher Lemmon, a former MSU undergraduate, collected these data
from a survey he took of MSU students in Fall 1994.
df = wooldridge.data("gpa1")
df.head()model01 = smf.ols("colGPA ~ hsGPA + ACT", data=df).fit()
model01.params, model01.nobs(Intercept 1.286328
hsGPA 0.453456
ACT 0.009426
dtype: float64,
141.0)1.29 is the predicted college GPA is botg of hsGPA and ACT are zero, not meaningful cuz no one has zero high school GPA or a zero in ACT
Holding ACT fixed, another point on hsGPA is associated with 0.45 point increase in colGPA
Holding hsGPA fixed, a 10 unit change in ACT is associated with 0.094 change in colGPA, a very small effect
3.2 Hourly Wage Equation¶
wooldridge.data("wage1", description=True)name of dataset: wage1
no of variables: 24
no of observations: 526
+----------+---------------------------------+
| variable | label |
+----------+---------------------------------+
| wage | average hourly earnings |
| educ | years of education |
| exper | years potential experience |
| tenure | years with current employer |
| nonwhite | =1 if nonwhite |
| female | =1 if female |
| married | =1 if married |
| numdep | number of dependents |
| smsa | =1 if live in SMSA |
| northcen | =1 if live in north central U.S |
| south | =1 if live in southern region |
| west | =1 if live in western region |
| construc | =1 if work in construc. indus. |
| ndurman | =1 if in nondur. manuf. indus. |
| trcommpu | =1 if in trans, commun, pub ut |
| trade | =1 if in wholesale or retail |
| services | =1 if in services indus. |
| profserv | =1 if in prof. serv. indus. |
| profocc | =1 if in profess. occupation |
| clerocc | =1 if in clerical occupation |
| servocc | =1 if in service occupation |
| lwage | log(wage) |
| expersq | exper^2 |
| tenursq | tenure^2 |
+----------+---------------------------------+
These are data from the 1976 Current Population Survey, collected by
Henry Farber when he and I were colleagues at MIT in 1988.
df = wooldridge.data("wage1")
df.head()model02 = smf.ols("np.log(wage) ~ educ + exper + tenure", data=df).fit()
model02.params, model02.nobs(Intercept 0.284360
educ 0.092029
exper 0.004121
tenure 0.022067
dtype: float64,
526.0)Holding exper and tenure fixed, one more year of education is predicted to increase wage by
3.3 Participation in 401(k) Pension Plans¶
wooldridge.data("401k", description=True)name of dataset: 401k
no of variables: 8
no of observations: 1534
+----------+---------------------------------+
| variable | label |
+----------+---------------------------------+
| prate | participation rate, percent |
| mrate | 401k plan match rate |
| totpart | total 401k participants |
| totelg | total eligible for 401k plan |
| age | age of 401k plan |
| totemp | total number of firm employees |
| sole | = 1 if 401k is firm's sole plan |
| ltotemp | log of totemp |
+----------+---------------------------------+
L.E. Papke (1995), “Participation in and Contributions to 401(k)
Pension Plans:Evidence from Plan Data,” Journal of Human Resources 30,
311-325. Professor Papke kindly provided these data. She gathered them
from the Internal Revenue Service’s Form 5500 tapes.
df = wooldridge.data("401k")
df.head()model03 = smf.ols("prate ~ mrate + age", data=df).fit()
model03.params, model03.nobs(Intercept 80.119047
mrate 5.521289
age 0.243147
dtype: float64,
1534.0)model032 = smf.ols("prate ~ mrate", data=df).fit()
model032.params, model032.nobs(Intercept 83.075455
mrate 5.861079
dtype: float64,
1534.0)df["age"].corr(df["mrate"])np.float64(0.11878414864794844)The simple regression estimate of mrate effect is different from multiple regression, but the difference is not big due to the small correlation of 0.12
3.4 Determinants of College GPA¶
df = wooldridge.data("gpa1")model04 = smf.ols("colGPA ~ hsGPA + ACT", data=df).fit()
summarize_model(model04)means that hsGPA and ACT both explain around 17.6% of the variation in colGPA in the sample
3.5 Explaining Arrest Records¶
wooldridge.data("crime1", description=True)name of dataset: crime1
no of variables: 16
no of observations: 2725
+----------+---------------------------------+
| variable | label |
+----------+---------------------------------+
| narr86 | # times arrested, 1986 |
| nfarr86 | # felony arrests, 1986 |
| nparr86 | # property crme arr., 1986 |
| pcnv | proportion of prior convictions |
| avgsen | avg sentence length, mos. |
| tottime | time in prison since 18 (mos.) |
| ptime86 | mos. in prison during 1986 |
| qemp86 | # quarters employed, 1986 |
| inc86 | legal income, 1986, $100s |
| durat | recent unemp duration |
| black | =1 if black |
| hispan | =1 if Hispanic |
| born60 | =1 if born in 1960 |
| pcnvsq | pcnv^2 |
| pt86sq | ptime86^2 |
| inc86sq | inc86^2 |
+----------+---------------------------------+
J. Grogger (1991), “Certainty vs. Severity of Punishment,” Economic
Inquiry 29, 297-309. Professor Grogger kindly provided a subset of the
data he used in his article.
df = wooldridge.data("crime1")
df.head()model05 = smf.ols("narr86 ~ pcnv + ptime86 + qemp86", data=df).fit()
summarize_model(model05)model052 = smf.ols("narr86 ~ pcnv + ptime86 + qemp86 + avgsen", data=df).fit()
summarize_model(model052)Adding average sentence increases from 0.041 to 0.042 which is a small effect. And the sign is surprising.
A longer Average sentence increases criminal activity
3.6 Hourly Wage Equation¶
df = wooldridge.data("wage1")
df.head()model06 = smf.ols("np.log(wage) ~ educ", data=df).fit()
summarize_model(model06)ability is an omitted variable that is positively correlated with education.
Hence, the equation is on average too large
we can’t say that 0.083 is greater than , it can be higher or lower than the true value
But on average, the estimates from all random samples will be large
3.7 Evaluating a Job Training Program¶
Computer Problems¶
¶
C1 A problem of interest to health officials (and others) is to determine the effects of smoking during pregnancy on infant health. One measure of infant health is birth weight;
a birth weight that is too low can put an infant at risk for contracting various illnesses. Since factors other than cigarette smoking that affect birth weight are likely to be correlated with smoking, we should take those factors into account.
For example, higher income generally results in access to better prenatal care, as well as better nutrition for the mother. An equation that recognizes this is
(i) What is the most likely sign for ?
because more income is likely correlated with better nutirition
(ii) Do you think cigs and faminc are likely to be correlated? Explain why the correlation might be positive or negative
Yes, negative correlation. As income increases, personal care and education will also increase, resulting in fewer ciggarates
(iii) Now, estimate the equation with and without faminc, using the data in BWGHT. Report the results in equation form, including the sample size and R-squared. Discuss your results, focusing on whether adding faminc substantially changes the estimated effect of cigs on bwght
wooldridge.data("bwght", description=True)name of dataset: bwght
no of variables: 14
no of observations: 1388
+----------+--------------------------------+
| variable | label |
+----------+--------------------------------+
| faminc | 1988 family income, $1000s |
| cigtax | cig. tax in home state, 1988 |
| cigprice | cig. price in home state, 1988 |
| bwght | birth weight, ounces |
| fatheduc | father's yrs of educ |
| motheduc | mother's yrs of educ |
| parity | birth order of child |
| male | =1 if male child |
| white | =1 if white |
| cigs | cigs smked per day while preg |
| lbwght | log of bwght |
| bwghtlbs | birth weight, pounds |
| packs | packs smked per day while preg |
| lfaminc | log(faminc) |
+----------+--------------------------------+
J. Mullahy (1997), “Instrumental-Variable Estimation of Count Data
Models: Applications to Models of Cigarette Smoking Behavior,” Review
of Economics and Statistics 79, 596-593. Professor Mullahy kindly
provided the data. He obtained them from the 1988 National Health
Interview Survey.
df = wooldridge.data("bwght")
df.head()model1 = smf.ols(formula="bwght ~ cigs", data=df).fit()
model1.params, model1.nobs, model1.rsquared(Intercept 119.771900
cigs -0.513772
dtype: float64,
1388.0,
np.float64(0.022729121106052963))model2 = smf.ols(formula="bwght ~ faminc + cigs", data=df).fit()
model2.params, model2.nobs, model2.rsquared(Intercept 116.974130
faminc 0.092765
cigs -0.463408
dtype: float64,
1388.0,
np.float64(0.029804837327581657))df["faminc"].corr(df["cigs"])np.float64(-0.17304492573586483)The correlation between faminc and cigs is weakly negative. When faminc is omitted, its positive effect on birth weight is absorbed into the cigs coefficient. Since faminc and cigs are negatively correlated, this leads to downward (negative) bias, making the effect of cigs appear slighly more harmful than it actually is.
wooldridge.data("hprice1", description=True)name of dataset: hprice1
no of variables: 10
no of observations: 88
+----------+------------------------------+
| variable | label |
+----------+------------------------------+
| price | house price, $1000s |
| assess | assessed value, $1000s |
| bdrms | number of bdrms |
| lotsize | size of lot in square feet |
| sqrft | size of house in square feet |
| colonial | =1 if home is colonial style |
| lprice | log(price) |
| lassess | log(assess |
| llotsize | log(lotsize) |
| lsqrft | log(sqrft) |
+----------+------------------------------+
Collected from the real estate pages of the Boston Globe during 1990.
These are homes that sold in the Boston, MA area.
df = wooldridge.data("hprice1")
df.head()(i) Write out the results in equation form.
model = smf.ols(formula="price ~ sqrft + bdrms", data=df).fit()
model.params, model.nobs, model.rsquared(Intercept -19.314996
sqrft 0.128436
bdrms 15.198191
dtype: float64,
88.0,
np.float64(0.6319184011671723))(ii) What is the estimated increase in price for a house with one more bedroom, holding square footage constant?
One bedroom increase will increase price by $15,200 holding other factors fixed (price in equation is in thousands)
(iii) What is the estimated increase in price for a house with an additional bedroom that is 140 square feet in size? Compare this to your answer in part (ii).
model.params["sqrft"] * 140 + model.params["bdrms"] * 1np.float64(33.17926041938293)An additional room will result in home sqrft increasing leading to price increase of $33,179
(iv) What percentage of the variation in price is explained by square footage and number of bedrooms?
About 63.2%
(v) The first house in the sample has sqrft = 2,438 and bdrms= 4. Find the predicted selling price for this house from the OLS regression line.
model.predict(df.iloc[0])0 354.605249
dtype: float64print(
model.params["Intercept"] + model.params["sqrft"] * 2438 + model.params["bdrms"] * 4
)354.6052489839994
The predicted price is $354,605
(vi) The actual selling price of the first house in the sample was $300,000 (so price = 300). Find the residual for this house. Does it suggest that the buyer underpaid or overpaid for the house?
df.loc[0]["price"] - model.predict(df.iloc[0])0 -54.605249
dtype: float64Residual value is -54.605 which means that the buyer underpaid the house by $54,605
C3¶
The file CEOSAL2 contains data on 177 chief executive officers and can be used to examine the effects of firm performance on CEO salary
wooldridge.data("ceosal2", description=True)name of dataset: ceosal2
no of variables: 15
no of observations: 177
+----------+--------------------------------+
| variable | label |
+----------+--------------------------------+
| salary | 1990 compensation, $1000s |
| age | in years |
| college | =1 if attended college |
| grad | =1 if attended graduate school |
| comten | years with company |
| ceoten | years as ceo with company |
| sales | 1990 firm sales, millions |
| profits | 1990 profits, millions |
| mktval | market value, end 1990, mills. |
| lsalary | log(salary) |
| lsales | log(sales) |
| lmktval | log(mktval) |
| comtensq | comten^2 |
| ceotensq | ceoten^2 |
| profmarg | profits as % of sales |
+----------+--------------------------------+
See CEOSAL1.RAW
df = wooldridge.data("ceosal2")
df.head()(i) Estimate a model relating annual salary to firm sales and market value. Make the model of the constant elasticity variety for both independent variables. Write the results out in equation form.
model = smf.ols(
formula="np.log(salary) ~ np.log(sales) + np.log(mktval)", data=df
).fit()
model.params, model.nobs, model.rsquared(Intercept 4.620917
np.log(sales) 0.162128
np.log(mktval) 0.106708
dtype: float64,
177.0,
np.float64(0.29911361599869846))(ii) Add profits to the model from part (i). Why can this variable not be included in logarithmic form? Would you say that these firm performance variables explain most of the variation in CEO salaries?
df["profits"].describe()count 177.000000
mean 207.830508
std 404.454296
min -463.000000
25% 34.000000
50% 63.000000
75% 208.000000
max 2700.000000
Name: profits, dtype: float64model2 = smf.ols(
formula=("np.log(salary) ~ np.log(sales) + " "np.log(mktval) + np.log(profits)"),
data=df,
).fit()
model2.params, model2.nobs, model2.rsquarede:\miniconda3\Lib\site-packages\pandas\core\arraylike.py:399: RuntimeWarning: invalid value encountered in log
result = getattr(ufunc, method)(*inputs, **kwargs)
e:\miniconda3\Lib\site-packages\pandas\core\arraylike.py:399: RuntimeWarning: invalid value encountered in log
result = getattr(ufunc, method)(*inputs, **kwargs)
(Intercept 4.211870
np.log(sales) 0.211419
np.log(mktval) 0.175385
np.log(profits) -0.102808
dtype: float64,
168.0,
np.float64(0.3388726790440221))Can’t use because it contains negative values
model3 = smf.ols(
formula=("np.log(salary) ~ np.log(sales) + " "np.log(mktval) + profits"), data=df
).fit()
model3.params, model3.nobs, model3.rsquared(Intercept 4.686924
np.log(sales) 0.161368
np.log(mktval) 0.097529
profits 0.000036
dtype: float64,
177.0,
np.float64(0.2993366493504458))model3.params["profits"] * 1000np.float64(0.0356600694005661)If Profits increase by 1 billion, salary is expected to increase approximately by 3.6%, ceteris paribus
The model explains around 30% of the sample variation in log sales which is not a lot.
(iii) Add the variable ceoten to the model in part (ii). What is the estimated percentage return for another year of CEO tenure, holding other factors fixed?
model4 = smf.ols(
formula=("np.log(salary) ~ np.log(sales) + " "np.log(mktval) + profits + ceoten"),
data=df,
).fit()
model4.params, model4.nobs, model4.rsquared(Intercept 4.557780
np.log(sales) 0.162234
np.log(mktval) 0.101760
profits 0.000029
ceoten 0.011685
dtype: float64,
177.0,
np.float64(0.3182987842099966))model4.params["ceoten"] * 100np.float64(1.1684668422563895)One additional year for a CEO is expected to increase salary by approximately 1.17%
(iv) Find the sample correlation coefficient between the variables log(mktval) and profits. Are these variables highly correlated? What does this say about the OLS estimators?
df["profits"].corr(np.log(df["mktval"]))np.float64(0.7768975970597894)The correlation is roughly 0.78, which indicates presence of multicollinearity. Multicollineairty does not bias the OLS esitmators but will increase the variance.
C4¶
Use the data in ATTEND for this exercise
wooldridge.data("attend", description=True)name of dataset: attend
no of variables: 11
no of observations: 680
+----------+------------------------------+
| variable | label |
+----------+------------------------------+
| attend | classes attended out of 32 |
| termGPA | GPA for term |
| priGPA | cumulative GPA prior to term |
| ACT | ACT score |
| final | final exam score |
| atndrte | percent classes attended |
| hwrte | percent homework turned in |
| frosh | =1 if freshman |
| soph | =1 if sophomore |
| missed | number of classes missed |
| stndfnl | (final - mean)/sd |
+----------+------------------------------+
These data were collected by Professors Ronald Fisher and Carl
Liedholm during a term in which they both taught principles of
microeconomics at Michigan State University. Professors Fisher and
Liedholm kindly gave me permission to use a random subset of their
data, and their research assistant at the time, Jeffrey Guilfoyle, who
completed his Ph.D. in economics at MSU, provided helpful hints.
df = wooldridge.data("attend")
df.head()(i) Obtain the minimum, maximum, and average values for the variables atndrte, priGPA, and ACT
df[["atndrte", "priGPA", "ACT"]].agg(["min", "max", "mean"])(ii) Estimate the model
and write the results in equation form. Interpret the intercept. Does it have a useful meaning?
model = smf.ols("atndrte ~ priGPA + ACT", data=df).fit()
model.params, model.nobs, model.rsquared(Intercept 75.700405
priGPA 17.260591
ACT -1.716553
dtype: float64,
680.0,
np.float64(0.29058148401574935))The intercept indicates that a student with a prior GPA of zero and an ACT score of zero is predicted to have an average attendance rate of 75.7%. However, this estimate is not reliable, as it is the result of extrapolation beyond the range of the observed data—no students in the sample have both a zero GPA and a zero ACT score.
(iii) Discuss the estimated slope coefficients. Are there any surprises?
df["priGPA"].corr(df["ACT"])np.float64(0.3537693606849367)A one point increase in pri GPA increases attendace rate by 17.3 percentage points, holding ACT fixed.
A one point increase in ACT score decreases attendace rate by 1.7 percentage points, holding prior GPA fixed. The negative relation may reflect that people with high ACT scores believe they can pass the lectures
(iv) What is the predicted atndrte if priGPA = 3.65 and ACT = 20? What do you make of this result? Are there any students in the sample with these values of the explanatory variables?
new_data = pd.DataFrame({"priGPA": [3.65], "ACT": [20]})
model.predict(new_data)0 104.370504
dtype: float64tolerance = 0.1
df.loc[
(df["priGPA"].sub(3.65).abs() <= tolerance) & (df["ACT"].sub(20).abs() <= tolerance)
]A student with prior GPA of 3.65 and ACT score of 20 is exptected to have 104% attendance rate.
Since that attendance rate is bounded between zero and 100%, we predict 100% instead
The actual student with this data has participation rate of 87.5%
(v) If Student A has priGPA = 3.1 and ACT = 21 and Student B has priGPA = 2.1 and ACT = 26, what is the predicted difference in their attendance rates?
new_data = pd.DataFrame({"priGPA": [3.1, 2.1], "ACT": [21, 26]})
y_hat = model.predict(new_data)
y_hat[0] - y_hat[1]np.float64(25.843355740233946)The difference between the two predictions is around 25.84 percentage points