Limited Dependent Variable Models and Sample Selection Corrections

import pandas as pd
import numpy as np
import wooldridge
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from statsmodels.iolib.summary2 import summary_col
from sklearn.metrics import accuracy_score
import re

wooldridge.data()

  J.M. Wooldridge (2019) Introductory Econometrics: A Modern Approach,
  Cengage Learning, 6th edition.

  401k       401ksubs    admnrev       affairs     airfare
  alcohol    apple       approval      athlet1     athlet2
  attend     audit       barium        beauty      benefits
  beveridge  big9salary  bwght         bwght2      campus
  card       catholic    cement        census2000  ceosal1
  ceosal2    charity     consump       corn        countymurders
  cps78_85   cps91       crime1        crime2      crime3
  crime4     discrim     driving       earns       econmath
  elem94_95  engin       expendshares  ezanders    ezunem
  fair       fertil1     fertil2       fertil3     fish
  fringe     gpa1        gpa2          gpa3        happiness
  hprice1    hprice2     hprice3       hseinv      htv
  infmrt     injury      intdef        intqrt      inven
  jtrain     jtrain2     jtrain3       kielmc      lawsch85
  loanapp    lowbrth     mathpnl       meap00_01   meap01
  meap93     meapsingle  minwage       mlb1        mroz
  murder     nbasal      nyse          okun        openness
  pension    phillips    pntsprd       prison      prminwge
  rdchem     rdtelec     recid         rental      return
  saving     sleep75     slp75_81      smoke       traffic1
  traffic2   twoyear     volat         vote1       vote2
  voucher    wage1       wage2         wagepan     wageprc
  wine

Examples¶

mroz = wooldridge.data('mroz')

17.1 Married Women’s Labor Force Participation¶

wooldridge.data('mroz', description= True)

name of dataset: mroz
no of variables: 22
no of observations: 753

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| inlf     | =1 if in lab frce, 1975         |
| hours    | hours worked, 1975              |
| kidslt6  | # kids < 6 years                |
| kidsge6  | # kids 6-18                     |
| age      | woman's age in yrs              |
| educ     | years of schooling              |
| wage     | est. wage from earn, hrs        |
| repwage  | rep. wage at interview in 1976  |
| hushrs   | hours worked by husband, 1975   |
| husage   | husband's age                   |
| huseduc  | husband's years of schooling    |
| huswage  | husband's hourly wage, 1975     |
| faminc   | family income, 1975             |
| mtr      | fed. marg. tax rte facing woman |
| motheduc | mother's years of schooling     |
| fatheduc | father's years of schooling     |
| unem     | unem. rate in county of resid.  |
| city     | =1 if live in SMSA              |
| exper    | actual labor mkt exper          |
| nwifeinc | (faminc - wage*hours)/1000      |
| lwage    | log(wage)                       |
| expersq  | exper^2                         |
+----------+---------------------------------+

T.A. Mroz (1987), “The Sensitivity of an Empirical Model of Married
Women’s Hours of Work to Economic and Statistical Assumptions,”
Econometrica 55, 765-799. Professor Ernst R. Berndt, of MIT, kindly
provided the data, which he obtained from Professor Mroz.

Comparing between linear probability model, logit model and probit model:

model01 = smf.ols('inlf ~ nwifeinc + educ + exper + expersq + age + kidslt6 + kidsge6', data = mroz).fit(cov_type='HC3')
print(model01.summary2().tables[1].iloc[:,:4])
print(model01.rsquared)

              Coef.  Std.Err.         z         P>|z|
Intercept  0.585519  0.153580  3.812463  1.375890e-04
nwifeinc  -0.003405  0.001558 -2.185242  2.887114e-02
educ       0.037995  0.007340  5.176602  2.259636e-07
exper      0.039492  0.005984  6.600115  4.108390e-11
expersq   -0.000596  0.000199 -2.997303  2.723797e-03
age       -0.016091  0.002415 -6.664002  2.664705e-11
kidslt6   -0.261810  0.032152 -8.143000  3.856034e-16
kidsge6    0.013012  0.013660  0.952558  3.408143e-01
0.26421615770495754

model012 = smf.logit('inlf ~ nwifeinc + educ + exper + expersq + age + kidslt6 + kidsge6', data = mroz).fit()
print(model012.summary2().tables[1].iloc[:,:4])
print(logit_marg:= model012.get_margeff().summary())
print(model012.prsquared)

Optimization terminated successfully.
         Current function value: 0.533553
         Iterations 6
              Coef.  Std.Err.         z         P>|z|
Intercept  0.425452  0.860370  0.494499  6.209535e-01
nwifeinc  -0.021345  0.008421 -2.534620  1.125693e-02
educ       0.221170  0.043440  5.091442  3.553503e-07
exper      0.205870  0.032057  6.422001  1.344946e-10
expersq   -0.003154  0.001016 -3.104093  1.908635e-03
age       -0.088024  0.014573 -6.040232  1.538930e-09
kidslt6   -1.443354  0.203585 -7.089692  1.344105e-12
kidsge6    0.060112  0.074790  0.803749  4.215417e-01
        Logit Marginal Effects       
=====================================
Dep. Variable:                   inlf
Method:                          dydx
At:                           overall
==============================================================================
                dy/dx    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
nwifeinc      -0.0038      0.001     -2.571      0.010      -0.007      -0.001
educ           0.0395      0.007      5.414      0.000       0.025       0.054
exper          0.0368      0.005      7.139      0.000       0.027       0.047
expersq       -0.0006      0.000     -3.176      0.001      -0.001      -0.000
age           -0.0157      0.002     -6.603      0.000      -0.020      -0.011
kidslt6       -0.2578      0.032     -8.070      0.000      -0.320      -0.195
kidsge6        0.0107      0.013      0.805      0.421      -0.015       0.037
==============================================================================
0.21968137484058814

model013 = smf.probit('inlf ~ nwifeinc + educ + exper + expersq + age + kidslt6 + kidsge6', data = mroz).fit()
print(model013.summary2().tables[1].iloc[:,:4])
print(probit_marg:= model013.get_margeff().summary())
print(model013.prsquared)

Optimization terminated successfully.
         Current function value: 0.532938
         Iterations 5
              Coef.  Std.Err.         z         P>|z|
Intercept  0.270077  0.508593  0.531027  5.953999e-01
nwifeinc  -0.012024  0.004840 -2.484327  1.297967e-02
educ       0.130905  0.025254  5.183485  2.177783e-07
exper      0.123348  0.018716  6.590348  4.387977e-11
expersq   -0.001887  0.000600 -3.145205  1.659704e-03
age       -0.052853  0.008477 -6.234656  4.527723e-10
kidslt6   -0.868329  0.118522 -7.326287  2.366161e-13
kidsge6    0.036005  0.043477  0.828142  4.075900e-01
       Probit Marginal Effects       
=====================================
Dep. Variable:                   inlf
Method:                          dydx
At:                           overall
==============================================================================
                dy/dx    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
nwifeinc      -0.0036      0.001     -2.509      0.012      -0.006      -0.001
educ           0.0394      0.007      5.452      0.000       0.025       0.054
exper          0.0371      0.005      7.200      0.000       0.027       0.047
expersq       -0.0006      0.000     -3.205      0.001      -0.001      -0.000
age           -0.0159      0.002     -6.739      0.000      -0.021      -0.011
kidslt6       -0.2612      0.032     -8.197      0.000      -0.324      -0.199
kidsge6        0.0108      0.013      0.829      0.407      -0.015       0.036
==============================================================================
0.2205805437252938

To get percentage predicted correctly

y = mroz['inlf']

acc_lpm = accuracy_score(y, (model01.predict(mroz) >= 0.5).astype(int))
acc_logit = accuracy_score(y, (model012.predict(mroz) >= 0.5).astype(int))
acc_probit = accuracy_score(y, (model013.predict(mroz) >= 0.5).astype(int))

acc_lpm, acc_logit, acc_probit

(0.7343957503320053, 0.7357237715803453, 0.7343957503320053)

# Generate table
results_table = summary_col(
    [model01, model012, model013],
    stars=True,
    float_format="%.3f",
    model_names=["LPM (OLS)", "Logit (MLE)", "Probit (MLE)"],
    info_dict={
        "Fit (R² or Pseudo R²)": lambda x: f"{x.rsquared:.3f}" if hasattr(x, "rsquared") else f"{x.prsquared:.3f}",
        "Log-Lik": lambda x: f"{x.llf:.2f}" if hasattr(x, "llf") else "",
        "% Correct": lambda x: f"{100 * accuracy_score(y, (x.predict(mroz) >= 0.5).astype(int)):.1f}"
    }
)

# Get string output and filter out unwanted lines
table_str = results_table.as_text()
filtered_str = "\n".join(
    line for line in table_str.splitlines()
    if not re.search(r"R-squared|Adj. R-squared", line)
)

# Print cleaned-up table
print(filtered_str)


========================================================
                      LPM (OLS) Logit (MLE) Probit (MLE)
--------------------------------------------------------
Intercept             0.586***  0.425       0.270       
                      (0.154)   (0.860)     (0.509)     
nwifeinc              -0.003**  -0.021**    -0.012**    
                      (0.002)   (0.008)     (0.005)     
educ                  0.038***  0.221***    0.131***    
                      (0.007)   (0.043)     (0.025)     
exper                 0.039***  0.206***    0.123***    
                      (0.006)   (0.032)     (0.019)     
expersq               -0.001*** -0.003***   -0.002***   
                      (0.000)   (0.001)     (0.001)     
age                   -0.016*** -0.088***   -0.053***   
                      (0.002)   (0.015)     (0.008)     
kidslt6               -0.262*** -1.443***   -0.868***   
                      (0.032)   (0.204)     (0.119)     
kidsge6               0.013     0.060       0.036       
                      (0.014)   (0.075)     (0.043)     
% Correct             73.4      73.6        73.4        
Fit (R² or Pseudo R²) 0.264     0.220       0.221       
Log-Lik               -423.89   -401.77     -401.30     
========================================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

The three models agree on the statistically significant variables. But the coefficients are not comparable. Use marginal effect for logit and probit

tab0 = model01.summary2().tables[1].iloc[1:,0].round(4)
tab0.name = "LPM (OLS)"

tab1 = logit_marg.tables[1]
tab1 = pd.DataFrame(tab1)
tab1 = tab1.iloc[1:,1]
tab1.index = tab0.index
tab1.name = "Logit (Marg Eff)"

tab2 = probit_marg.tables[1]
tab2 = pd.DataFrame(tab2)
tab2 = tab2.iloc[1:,1]
tab2.index = tab0.index
tab2.name = "Probit (Marg Eff)"


pd.concat([tab0, tab1, tab2], axis=1)

Interpreting logistic regression :coefficients and marginal effects

coefficient of educ = 0.2212:
A one year increase in education is associated with a 0.2212 unit increase in the log odds of a woman participating in the labor force

APE of educ = 0.0395:
A one year increase in education is associated with a 3.95 percentage point increase in probability of labor force participation

interpreting kidslt6, LPM vs Probit
in LPM model, having one more kid less than 6 years old is estimated to reduce probability of labor force participation by 0.262 percentage point regardless of number of kids.

kid0 = {
    'nwifeinc': 20.13,
    'educ': 12.3,
    'exper': 10.6,
    'expersq': 10.6**2,
    'age': 42.5,
    'kidsge6': 1,
    'kidslt6': 0
}
kid1 = kid0.copy()
kid1['kidslt6'] = 1

kid2 = kid0.copy()
kid2['kidslt6'] = 2

probs = model013.predict(pd.DataFrame([kid0, kid1, kid2]))

prob_lost1 = probs.iloc[1] - probs.iloc[0]
prob_lost2 = probs.iloc[2] - probs.iloc[1] 

probs, prob_lost1, prob_lost2

(0    0.699647
 1    0.365069
 2    0.112513
 dtype: float64,
 np.float64(-0.33457838066203827),
 np.float64(-0.252555628152638))

Probit model takes into account actual values of the person. So a woman with $nwifeinc = 20.13, educ = 12.3, exper = 10.6, age = 42.5, kidge6 =1$ and went from 0 to 1 small child will get probability of labour force participation lower by 0.334
If the woman goes from 1 small child to 2, probability decreases again by 0.256

17.2 Married Women’s Annual Labor Supply¶

Introductory Econometrics with Python

Multiple Regression Analysis with Qualitative Information