import pandas as pd
import numpy as np
import wooldridge
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from statsmodels.iolib.summary2 import summary_col
from sklearn.metrics import accuracy_score
import rewooldridge.data() J.M. Wooldridge (2019) Introductory Econometrics: A Modern Approach,
Cengage Learning, 6th edition.
401k 401ksubs admnrev affairs airfare
alcohol apple approval athlet1 athlet2
attend audit barium beauty benefits
beveridge big9salary bwght bwght2 campus
card catholic cement census2000 ceosal1
ceosal2 charity consump corn countymurders
cps78_85 cps91 crime1 crime2 crime3
crime4 discrim driving earns econmath
elem94_95 engin expendshares ezanders ezunem
fair fertil1 fertil2 fertil3 fish
fringe gpa1 gpa2 gpa3 happiness
hprice1 hprice2 hprice3 hseinv htv
infmrt injury intdef intqrt inven
jtrain jtrain2 jtrain3 kielmc lawsch85
loanapp lowbrth mathpnl meap00_01 meap01
meap93 meapsingle minwage mlb1 mroz
murder nbasal nyse okun openness
pension phillips pntsprd prison prminwge
rdchem rdtelec recid rental return
saving sleep75 slp75_81 smoke traffic1
traffic2 twoyear volat vote1 vote2
voucher wage1 wage2 wagepan wageprc
wine
Examples¶
mroz = wooldridge.data('mroz')17.1 Married Women’s Labor Force Participation¶
wooldridge.data('mroz', description= True)name of dataset: mroz
no of variables: 22
no of observations: 753
+----------+---------------------------------+
| variable | label |
+----------+---------------------------------+
| inlf | =1 if in lab frce, 1975 |
| hours | hours worked, 1975 |
| kidslt6 | # kids < 6 years |
| kidsge6 | # kids 6-18 |
| age | woman's age in yrs |
| educ | years of schooling |
| wage | est. wage from earn, hrs |
| repwage | rep. wage at interview in 1976 |
| hushrs | hours worked by husband, 1975 |
| husage | husband's age |
| huseduc | husband's years of schooling |
| huswage | husband's hourly wage, 1975 |
| faminc | family income, 1975 |
| mtr | fed. marg. tax rte facing woman |
| motheduc | mother's years of schooling |
| fatheduc | father's years of schooling |
| unem | unem. rate in county of resid. |
| city | =1 if live in SMSA |
| exper | actual labor mkt exper |
| nwifeinc | (faminc - wage*hours)/1000 |
| lwage | log(wage) |
| expersq | exper^2 |
+----------+---------------------------------+
T.A. Mroz (1987), “The Sensitivity of an Empirical Model of Married
Women’s Hours of Work to Economic and Statistical Assumptions,”
Econometrica 55, 765-799. Professor Ernst R. Berndt, of MIT, kindly
provided the data, which he obtained from Professor Mroz.
Comparing between linear probability model, logit model and probit model:
model01 = smf.ols('inlf ~ nwifeinc + educ + exper + expersq + age + kidslt6 + kidsge6', data = mroz).fit(cov_type='HC3')
print(model01.summary2().tables[1].iloc[:,:4])
print(model01.rsquared) Coef. Std.Err. z P>|z|
Intercept 0.585519 0.153580 3.812463 1.375890e-04
nwifeinc -0.003405 0.001558 -2.185242 2.887114e-02
educ 0.037995 0.007340 5.176602 2.259636e-07
exper 0.039492 0.005984 6.600115 4.108390e-11
expersq -0.000596 0.000199 -2.997303 2.723797e-03
age -0.016091 0.002415 -6.664002 2.664705e-11
kidslt6 -0.261810 0.032152 -8.143000 3.856034e-16
kidsge6 0.013012 0.013660 0.952558 3.408143e-01
0.26421615770495754
model012 = smf.logit('inlf ~ nwifeinc + educ + exper + expersq + age + kidslt6 + kidsge6', data = mroz).fit()
print(model012.summary2().tables[1].iloc[:,:4])
print(logit_marg:= model012.get_margeff().summary())
print(model012.prsquared)Optimization terminated successfully.
Current function value: 0.533553
Iterations 6
Coef. Std.Err. z P>|z|
Intercept 0.425452 0.860370 0.494499 6.209535e-01
nwifeinc -0.021345 0.008421 -2.534620 1.125693e-02
educ 0.221170 0.043440 5.091442 3.553503e-07
exper 0.205870 0.032057 6.422001 1.344946e-10
expersq -0.003154 0.001016 -3.104093 1.908635e-03
age -0.088024 0.014573 -6.040232 1.538930e-09
kidslt6 -1.443354 0.203585 -7.089692 1.344105e-12
kidsge6 0.060112 0.074790 0.803749 4.215417e-01
Logit Marginal Effects
=====================================
Dep. Variable: inlf
Method: dydx
At: overall
==============================================================================
dy/dx std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
nwifeinc -0.0038 0.001 -2.571 0.010 -0.007 -0.001
educ 0.0395 0.007 5.414 0.000 0.025 0.054
exper 0.0368 0.005 7.139 0.000 0.027 0.047
expersq -0.0006 0.000 -3.176 0.001 -0.001 -0.000
age -0.0157 0.002 -6.603 0.000 -0.020 -0.011
kidslt6 -0.2578 0.032 -8.070 0.000 -0.320 -0.195
kidsge6 0.0107 0.013 0.805 0.421 -0.015 0.037
==============================================================================
0.21968137484058814
model013 = smf.probit('inlf ~ nwifeinc + educ + exper + expersq + age + kidslt6 + kidsge6', data = mroz).fit()
print(model013.summary2().tables[1].iloc[:,:4])
print(probit_marg:= model013.get_margeff().summary())
print(model013.prsquared)Optimization terminated successfully.
Current function value: 0.532938
Iterations 5
Coef. Std.Err. z P>|z|
Intercept 0.270077 0.508593 0.531027 5.953999e-01
nwifeinc -0.012024 0.004840 -2.484327 1.297967e-02
educ 0.130905 0.025254 5.183485 2.177783e-07
exper 0.123348 0.018716 6.590348 4.387977e-11
expersq -0.001887 0.000600 -3.145205 1.659704e-03
age -0.052853 0.008477 -6.234656 4.527723e-10
kidslt6 -0.868329 0.118522 -7.326287 2.366161e-13
kidsge6 0.036005 0.043477 0.828142 4.075900e-01
Probit Marginal Effects
=====================================
Dep. Variable: inlf
Method: dydx
At: overall
==============================================================================
dy/dx std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
nwifeinc -0.0036 0.001 -2.509 0.012 -0.006 -0.001
educ 0.0394 0.007 5.452 0.000 0.025 0.054
exper 0.0371 0.005 7.200 0.000 0.027 0.047
expersq -0.0006 0.000 -3.205 0.001 -0.001 -0.000
age -0.0159 0.002 -6.739 0.000 -0.021 -0.011
kidslt6 -0.2612 0.032 -8.197 0.000 -0.324 -0.199
kidsge6 0.0108 0.013 0.829 0.407 -0.015 0.036
==============================================================================
0.2205805437252938
To get percentage predicted correctly
y = mroz['inlf']
acc_lpm = accuracy_score(y, (model01.predict(mroz) >= 0.5).astype(int))
acc_logit = accuracy_score(y, (model012.predict(mroz) >= 0.5).astype(int))
acc_probit = accuracy_score(y, (model013.predict(mroz) >= 0.5).astype(int))
acc_lpm, acc_logit, acc_probit(0.7343957503320053, 0.7357237715803453, 0.7343957503320053)# Generate table
results_table = summary_col(
[model01, model012, model013],
stars=True,
float_format="%.3f",
model_names=["LPM (OLS)", "Logit (MLE)", "Probit (MLE)"],
info_dict={
"Fit (R² or Pseudo R²)": lambda x: f"{x.rsquared:.3f}" if hasattr(x, "rsquared") else f"{x.prsquared:.3f}",
"Log-Lik": lambda x: f"{x.llf:.2f}" if hasattr(x, "llf") else "",
"% Correct": lambda x: f"{100 * accuracy_score(y, (x.predict(mroz) >= 0.5).astype(int)):.1f}"
}
)
# Get string output and filter out unwanted lines
table_str = results_table.as_text()
filtered_str = "\n".join(
line for line in table_str.splitlines()
if not re.search(r"R-squared|Adj. R-squared", line)
)
# Print cleaned-up table
print(filtered_str)
========================================================
LPM (OLS) Logit (MLE) Probit (MLE)
--------------------------------------------------------
Intercept 0.586*** 0.425 0.270
(0.154) (0.860) (0.509)
nwifeinc -0.003** -0.021** -0.012**
(0.002) (0.008) (0.005)
educ 0.038*** 0.221*** 0.131***
(0.007) (0.043) (0.025)
exper 0.039*** 0.206*** 0.123***
(0.006) (0.032) (0.019)
expersq -0.001*** -0.003*** -0.002***
(0.000) (0.001) (0.001)
age -0.016*** -0.088*** -0.053***
(0.002) (0.015) (0.008)
kidslt6 -0.262*** -1.443*** -0.868***
(0.032) (0.204) (0.119)
kidsge6 0.013 0.060 0.036
(0.014) (0.075) (0.043)
% Correct 73.4 73.6 73.4
Fit (R² or Pseudo R²) 0.264 0.220 0.221
Log-Lik -423.89 -401.77 -401.30
========================================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
The three models agree on the statistically significant variables. But the coefficients are not comparable. Use marginal effect for logit and probit
tab0 = model01.summary2().tables[1].iloc[1:,0].round(4)
tab0.name = "LPM (OLS)"
tab1 = logit_marg.tables[1]
tab1 = pd.DataFrame(tab1)
tab1 = tab1.iloc[1:,1]
tab1.index = tab0.index
tab1.name = "Logit (Marg Eff)"
tab2 = probit_marg.tables[1]
tab2 = pd.DataFrame(tab2)
tab2 = tab2.iloc[1:,1]
tab2.index = tab0.index
tab2.name = "Probit (Marg Eff)"
pd.concat([tab0, tab1, tab2], axis=1)Interpreting logistic regression :coefficients and marginal effects
coefficient of educ = 0.2212:
A one year increase in education is associated with a 0.2212 unit increase in the log odds of a woman participating in the labor force
APE of educ = 0.0395:
A one year increase in education is associated with a 3.95 percentage point increase in probability of labor force participation
interpreting kidslt6, LPM vs Probit
in LPM model, having one more kid less than 6 years old is estimated to reduce probability of labor force participation by 0.262 percentage point regardless of number of kids.
kid0 = {
'nwifeinc': 20.13,
'educ': 12.3,
'exper': 10.6,
'expersq': 10.6**2,
'age': 42.5,
'kidsge6': 1,
'kidslt6': 0
}
kid1 = kid0.copy()
kid1['kidslt6'] = 1
kid2 = kid0.copy()
kid2['kidslt6'] = 2probs = model013.predict(pd.DataFrame([kid0, kid1, kid2]))
prob_lost1 = probs.iloc[1] - probs.iloc[0]
prob_lost2 = probs.iloc[2] - probs.iloc[1]
probs, prob_lost1, prob_lost2(0 0.699647
1 0.365069
2 0.112513
dtype: float64,
np.float64(-0.33457838066203827),
np.float64(-0.252555628152638))Probit model takes into account actual values of the person. So a woman with and went from 0 to 1 small child will get probability of labour force participation lower by 0.334
If the woman goes from 1 small child to 2, probability decreases again by 0.256