import pandas as pd
import numpy as np
import wooldridge
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScalerdef duan_smearing_estimator(model):
"""
Compute Duan's smearing estimate (sigma0) from a fitted log-linear model.
Parameters:
model: A fitted statsmodels OLS model where the dependent variable is log(y).
Returns:
float: Smearing estimate sigma0.
"""
resid = model.resid # residuals from log-linear model
sigma0 = np.mean(np.exp(resid)) # Duan's smearing estimate
return sigma0wooldridge.data() J.M. Wooldridge (2019) Introductory Econometrics: A Modern Approach,
Cengage Learning, 6th edition.
401k 401ksubs admnrev affairs airfare
alcohol apple approval athlet1 athlet2
attend audit barium beauty benefits
beveridge big9salary bwght bwght2 campus
card catholic cement census2000 ceosal1
ceosal2 charity consump corn countymurders
cps78_85 cps91 crime1 crime2 crime3
crime4 discrim driving earns econmath
elem94_95 engin expendshares ezanders ezunem
fair fertil1 fertil2 fertil3 fish
fringe gpa1 gpa2 gpa3 happiness
hprice1 hprice2 hprice3 hseinv htv
infmrt injury intdef intqrt inven
jtrain jtrain2 jtrain3 kielmc lawsch85
loanapp lowbrth mathpnl meap00_01 meap01
meap93 meapsingle minwage mlb1 mroz
murder nbasal nyse okun openness
pension phillips pntsprd prison prminwge
rdchem rdtelec recid rental return
saving sleep75 slp75_81 smoke traffic1
traffic2 twoyear volat vote1 vote2
voucher wage1 wage2 wagepan wageprc
wine
Examples¶
pd.options.display.float_format = "{:.4f}".formathprice2 = wooldridge.data("hprice2")
attend = wooldridge.data("attend")
ceosal1 = wooldridge.data("ceosal1")
gpa2 = wooldridge.data("gpa2")
ceosal2 = wooldridge.data("ceosal2")6.1 Effects of Pollution on Housing Prices¶
To make beta coefficients, I use StandardScaler from sklearn library
wooldridge.data("hprice2", description=True)name of dataset: hprice2
no of variables: 12
no of observations: 506
+----------+-------------------------------+
| variable | label |
+----------+-------------------------------+
| price | median housing price, $ |
| crime | crimes committed per capita |
| nox | nit ox concen; parts per 100m |
| rooms | avg number of rooms |
| dist | wght dist to 5 employ centers |
| radial | access. index to rad. hghwys |
| proptax | property tax per $1000 |
| stratio | average student-teacher ratio |
| lowstat | perc of people 'lower status' |
| lprice | log(price) |
| lnox | log(nox) |
| lproptax | log(proptax) |
+----------+-------------------------------+
D. Harrison and D.L. Rubinfeld (1978), “Hedonic Housing Prices and the
Demand for Clean Air,” by Harrison, D. and D.L.Rubinfeld, Journal of
Environmental Economics and Management 5, 81-102. Diego Garcia, a
former Ph.D. student in economics at MIT, kindly provided these data,
which he obtained from the book Regression Diagnostics: Identifying
Influential Data and Sources of Collinearity, by D.A. Belsey, E. Kuh,
and R. Welsch, 1990. New York: Wiley.
hprice2.head()df = hprice2.copy()
vars_to_standardize = ["price", "nox", "crime", "rooms", "dist", "stratio"]
df[vars_to_standardize] = StandardScaler().fit_transform(df[vars_to_standardize])df.head()pd.set_option("display.float_format", "{:.3f}".format)
# pd.reset_option('display.float_format')model01 = smf.ols("price ~ nox + crime + rooms + dist + stratio", data=df).fit()
model01.summary2().tables[1].iloc[:, :2]One standard deviation in decreases price by 0.34 standard deviations.
One standard deviation in decreases price by 0.14 standard deviations.
nox affects price more than crime
t statistic does not change if data are standardized
6.2 Effects of Pollution on Housing Prices¶
model02 = smf.ols(
"lprice ~ lnox + np.log(dist) + rooms + I(rooms**2) + stratio", data=hprice2
).fit()
model02.summary2().tables[1].iloc[:, :4] is statistically significant. Coefficient on is negative and is positive.
Rooms have negative effect on price until some point, the effect becomes positive. To get this point
3 rooms and less decreasing price is not practical, the negative result is due to the sample containing few observations with 3 rooms and less
Note: effect of rooms depends on number of rooms, so examining coefficients alone is not enough
6.3 Effects of Attendance on Final Exam Performance¶
wooldridge.data("attend", description=True)name of dataset: attend
no of variables: 11
no of observations: 680
+----------+------------------------------+
| variable | label |
+----------+------------------------------+
| attend | classes attended out of 32 |
| termGPA | GPA for term |
| priGPA | cumulative GPA prior to term |
| ACT | ACT score |
| final | final exam score |
| atndrte | percent classes attended |
| hwrte | percent homework turned in |
| frosh | =1 if freshman |
| soph | =1 if sophomore |
| missed | number of classes missed |
| stndfnl | (final - mean)/sd |
+----------+------------------------------+
These data were collected by Professors Ronald Fisher and Carl
Liedholm during a term in which they both taught principles of
microeconomics at Michigan State University. Professors Fisher and
Liedholm kindly gave me permission to use a random subset of their
data, and their research assistant at the time, Jeffrey Guilfoyle, who
completed his Ph.D. in economics at MSU, provided helpful hints.
model03 = smf.ols(
"stndfnl ~ atndrte + priGPA + ACT + " "I(priGPA**2)+ I(ACT**2)+ I(priGPA*atndrte)",
data=attend,
).fit()
model03.summary2().tables[1].iloc[:, :4]When , attendance has a negative effect on exam score
t statistic is misleading. Check the F test
model03.params.indexIndex(['Intercept', 'atndrte', 'priGPA', 'ACT', 'I(priGPA ** 2)',
'I(ACT ** 2)', 'I(priGPA * atndrte)'],
dtype='object')model03.f_test("atndrte = 0, I(priGPA * atndrte) = 0")<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=4.319021763197198, p=0.013683848684002065, df_denom=673, df_num=2>To estimate partial effect of , plug values for in the interaction term
attend["priGPA"].mean()np.float64(2.586774999078582)Effect of is
if attendance rate increases by 10 percentage points, final exam score increase by 0.078 standard deviations
To test if the 0.0078 is significant from zero
model032 = smf.ols(
"stndfnl ~ atndrte + priGPA + ACT + "
"I(priGPA**2)+ I(ACT**2)+ I((priGPA-2.59)*atndrte)",
data=attend,
).fit()
model032.summary2().tables[1].iloc[:, :4]Using this method, coefficient changes to our calculated 0.0078, with p value of 0.0034, statistically significant
6.4 CEO Compensation and Firm Performance¶
wooldridge.data("ceosal1", description=True)name of dataset: ceosal1
no of variables: 12
no of observations: 209
+----------+-------------------------------+
| variable | label |
+----------+-------------------------------+
| salary | 1990 salary, thousands $ |
| pcsalary | % change salary, 89-90 |
| sales | 1990 firm sales, millions $ |
| roe | return on equity, 88-90 avg |
| pcroe | % change roe, 88-90 |
| ros | return on firm's stock, 88-90 |
| indus | =1 if industrial firm |
| finance | =1 if financial firm |
| consprod | =1 if consumer product firm |
| utility | =1 if transport. or utilties |
| lsalary | natural log of salary |
| lsales | natural log of sales |
+----------+-------------------------------+
I took a random sample of data reported in the May 6, 1991 issue of
Businessweek.
model04 = smf.ols("salary ~ sales + roe", data=ceosal1).fit()
model04.rsquared, model04.rsquared_adj(np.float64(0.029171686607148084), np.float64(0.019746169001392255))model042 = smf.ols("lsalary ~ lsales + roe", data=ceosal1).fit()
model042.rsquared, model042.rsquared_adj(np.float64(0.28198874478633673), np.float64(0.2750177617260099))The second model has higher and because variation in is lower due to taking the
y = model04.model.endog
y_bar = np.mean(y)
TSS = np.sum((y - y_bar) ** 2)y = model042.model.endog
y_bar = np.mean(y)
TSS2 = np.sum((y - y_bar) ** 2)TSS, TSS2(np.float64(391732982.00956935), np.float64(66.72216321299874))6.5 Confidence Interval for Predicted College GPA¶
wooldridge.data("gpa2", description=True)name of dataset: gpa2
no of variables: 12
no of observations: 4137
+----------+----------------------------------+
| variable | label |
+----------+----------------------------------+
| sat | combined SAT score |
| tothrs | total hours through fall semest |
| colgpa | GPA after fall semester |
| athlete | =1 if athlete |
| verbmath | verbal/math SAT score |
| hsize | size grad. class, 100s |
| hsrank | rank in grad. class |
| hsperc | high school percentile, from top |
| female | =1 if female |
| white | =1 if white |
| black | =1 if black |
| hsizesq | hsize^2 |
+----------+----------------------------------+
For confidentiality reasons, I cannot provide the source of these
data. I can say that they come from a midsize research university
that also supports men’s and women’s athletics at the Division I
level.
model05 = smf.ols("colgpa ~ sat + hsperc + hsize + hsizesq", data=gpa2).fit()
model05.summary2().tables[1].iloc[:, :2]model05.summary2().tables[0]se = model05.mse_resid**0.5
senp.float64(0.5598638444189318)Use the model to predict GPA at the values of the independent variables
new_data = pd.DataFrame({"sat": [1200], "hsperc": [30], "hsize": [5], "hsizesq": [25]})predicted_gpa = model05.predict(new_data)
predicted_gpa0 2.700
dtype: float64To calculate confidence interval for our predicted 2.7, subtract the values used for the prediction from the independent variables
df = gpa2.copy()
df["sat"] = gpa2["sat"] - 1200
df["hsize"] = gpa2["hsize"] - 5
df["hsperc"] = gpa2["hsperc"] - 30
df["hsizesq"] = gpa2["hsizesq"] - 25model052 = smf.ols("colgpa ~ sat + hsperc + hsize + hsizesq", data=df).fit()
model052.summary2().tables[1].iloc[:]The model resulted in same coefficients as before except for the intercept value. The confidence interval for the prediction is [2.66,2.74]
from statsmodels.iolib.summary2 import summary_col
summary = summary_col(
[model05, model052],
stars=True,
model_names=["Uncentered", "Centered"],
info_dict={
"R-squared": lambda x: f"{x.rsquared:.4f}",
"No. observations": lambda x: f"{int(x.nobs)}",
},
)
print(summary)
======================================
Uncentered Centered
--------------------------------------
Intercept 1.4927*** 2.7001***
(0.0753) (0.0199)
sat 0.0015*** 0.0015***
(0.0001) (0.0001)
hsperc -0.0139*** -0.0139***
(0.0006) (0.0006)
hsize -0.0609*** -0.0609***
(0.0165) (0.0165)
hsizesq 0.0055** 0.0055**
(0.0023) (0.0023)
R-squared 0.2781 0.2781
R-squared Adj. 0.2774 0.2774
No. observations 4137 4137
R-squared 0.2781 0.2781
======================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
Note: centering and then squaring it results in a different value than squaring and then centering it
The correct order is:
6.6 Confidence Interval for Future College GPA¶
To get prediction confidence intervals
model06 = smf.ols("colgpa ~ sat + hsperc + hsize + hsizesq", data=gpa2).fit()
model06.summary2().tables[1].iloc[:, :2]new_data = pd.DataFrame({"sat": [1200], "hsperc": [30], "hsize": [5], "hsizesq": [25]})pred = model06.get_prediction(new_data)
pred.summary_frame()The confidence interval for the prediction is [1.6, 3.8]. Its wider because it accounts for individual specific variation [residual variation] due to unobserved factors
Note: Example 6.5 got confidence interval for average college GPA for students with the given characteristics.
Example 6.6 added residual variation
6.7 Predicting CEO Salaries¶
wooldridge.data("ceosal2", description=True)name of dataset: ceosal2
no of variables: 15
no of observations: 177
+----------+--------------------------------+
| variable | label |
+----------+--------------------------------+
| salary | 1990 compensation, $1000s |
| age | in years |
| college | =1 if attended college |
| grad | =1 if attended graduate school |
| comten | years with company |
| ceoten | years as ceo with company |
| sales | 1990 firm sales, millions |
| profits | 1990 profits, millions |
| mktval | market value, end 1990, mills. |
| lsalary | log(salary) |
| lsales | log(sales) |
| lmktval | log(mktval) |
| comtensq | comten^2 |
| ceotensq | ceoten^2 |
| profmarg | profits as % of sales |
+----------+--------------------------------+
See CEOSAL1.RAW
model07 = smf.ols("lsalary ~ lsales + lmktval + ceoten", data=ceosal2).fit()
model07.summary2().tables[1].iloc[:, :2]resid = model07.resid
n = model07.nobsThe Duan Smeaing estimate is
sigma0 = np.sum(np.exp(resid)) / n
sigma0np.float64(1.1356613266630768)duan_smearing_estimator(model07)np.float64(1.1356613266630768)The prediction for
new_data = pd.DataFrame(
{"lsales": [np.log(5000)], "lmktval": [np.log(10000)], "ceoten": [10]}
)prediction = model07.predict(new_data)
prediction0 7.014
dtype: float64The predicted value is
np.exp(prediction) * sigma00 1263.059
dtype: float646.8 Predicting CEO Salaries¶
model08 = smf.ols("lsalary ~ lsales + lmktval + ceoten", data=ceosal2).fit()
model08.summary2().tables[1].iloc[:, :2]To get from our log model, take the exponential
Call it to avoid the confusion with that is predicted from the level model
mhat = np.exp(model08.predict())
df = ceosal2.copy()
df["mhat"] = mhatThen get the correlation between it and y and square it
corr = df["mhat"].corr(df["salary"])
np.square(corr)np.float64(0.2430807795867979)The log model explains 24.3% of the variation in
model082 = smf.ols("salary ~ sales + mktval + ceoten", data=ceosal2).fit()
model082.rsquarednp.float64(0.20127439139676495)The level model explains of the variation in while the log model explains of the variation in The log model is preferred to the level model