Multiple Regression Analysis: OLS Asymptotics - Introductory Econometrics with Python

import pandas as pd
import numpy as np
import wooldridge
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from scipy.stats import chi2

def lm_test(data, dependent, control_vars, test_vars):
    # Step 1: Fit restricted model
    restricted_formula = f"{dependent} ~ {' + '.join(control_vars)}"
    restricted_model = smf.ols(restricted_formula, data=data).fit()

    # Step 2: Save residuals
    data = data.copy()
    data["residuals"] = restricted_model.resid

    # Step 3: Fit auxiliary regression (residuals on all vars)
    full_vars = control_vars + test_vars
    auxiliary_formula = f"residuals ~ {' + '.join(full_vars)}"
    auxiliary_model = smf.ols(auxiliary_formula, data=data).fit()

    # Step 4: Compute LM statistic and p-value
    r_squared = auxiliary_model.rsquared
    n = len(data)
    q = len(test_vars)
    lm_stat = r_squared * n
    pval = 1 - chi2.cdf(lm_stat, df=q)

    return lm_stat, pval

Examples¶

bwght = wooldridge.data("bwght")
crime1 = wooldridge.data("crime1")

5.1 Housing Prices and Distances From an Incinerator¶

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u

(1)

$y$ : house price, $x_1$ : distance from incinerator, $x_2$ : quality of house

we expect $\beta_1 >0$ and $\beta_2 > 0$ so if $x_2$ is omitted and positively correlated with $x_1$ , OLS estimators are inconsistent

5.2 Standard Errors in a Birth Weight Equation¶

wooldridge.data("bwght", description=True)

name of dataset: bwght
no of variables: 14
no of observations: 1388

+----------+--------------------------------+
| variable | label                          |
+----------+--------------------------------+
| faminc   | 1988 family income, $1000s     |
| cigtax   | cig. tax in home state, 1988   |
| cigprice | cig. price in home state, 1988 |
| bwght    | birth weight, ounces           |
| fatheduc | father's yrs of educ           |
| motheduc | mother's yrs of educ           |
| parity   | birth order of child           |
| male     | =1 if male child               |
| white    | =1 if white                    |
| cigs     | cigs smked per day while preg  |
| lbwght   | log of bwght                   |
| bwghtlbs | birth weight, pounds           |
| packs    | packs smked per day while preg |
| lfaminc  | log(faminc)                    |
+----------+--------------------------------+

J. Mullahy (1997), “Instrumental-Variable Estimation of Count Data
Models: Applications to Models of Cigarette Smoking Behavior,” Review
of Economics and Statistics 79, 596-593. Professor Mullahy kindly
provided the data. He obtained them from the 1988 National Health
Interview Survey.

model02 = smf.ols("lbwght ~ cigs + lfaminc", data=bwght).fit()
model02.summary2().tables[1].iloc[:, :2]

df = bwght.iloc[: len(bwght) // 2]
model021 = smf.ols("lbwght ~ cigs + lfaminc", data=df).fit()
model021.summary2().tables[1].iloc[:, :2]

len(bwght), len(df)

(1388, 694)

Using half the dataset, the standard error for cigs is 0.0013 but becomes 0.00086 if we use the whole dataset. Ratio of standard errors is $0.00086/ 0.0013 = 0.662$ which is equal to the approcimation $\sqrt{694/1388} = 0.707$

\dfrac{SE_{new}}{SE_{old}} = \sqrt{\dfrac{n_{old}}{n_{new}}}

(2)

5.3 Economic Model of Crime¶

wooldridge.data("crime1", description=True)

name of dataset: crime1
no of variables: 16
no of observations: 2725

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| narr86   | # times arrested, 1986          |
| nfarr86  | # felony arrests, 1986          |
| nparr86  | # property crme arr., 1986      |
| pcnv     | proportion of prior convictions |
| avgsen   | avg sentence length, mos.       |
| tottime  | time in prison since 18 (mos.)  |
| ptime86  | mos. in prison during 1986      |
| qemp86   | # quarters employed, 1986       |
| inc86    | legal income, 1986, $100s       |
| durat    | recent unemp duration           |
| black    | =1 if black                     |
| hispan   | =1 if Hispanic                  |
| born60   | =1 if born in 1960              |
| pcnvsq   | pcnv^2                          |
| pt86sq   | ptime86^2                       |
| inc86sq  | inc86^2                         |
+----------+---------------------------------+

J. Grogger (1991), “Certainty vs. Severity of Punishment,” Economic
Inquiry 29, 297-309. Professor Grogger kindly provided a subset of the
data he used in his article.

model03 = smf.ols(
    "narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86", data=crime1
).fit()
model03.summary2().tables[1].iloc[:, :2]

To test that $avgsen, tottime$ have no effect using LM statistic

model032 = smf.ols("narr86 ~ pcnv  + ptime86 + qemp86", data=crime1).fit()
residuals = model032.resid

df = crime1.copy()
df["residuals"] = residuals
model033 = smf.ols(
    "residuals ~ pcnv  + ptime86 + qemp86 + " "avgsen + tottime", data=df
).fit()
rsquared = model033.rsquared

LM = rsquared * len(df)
LM

np.float64(4.070729461071595)

The 10% critical value in a chi square distribution with 2 degrees of freedom is 4.61, so we fail to reject the null hypothesis

pval = 1 - chi2.cdf(LM, df=2)
pval

np.float64(0.13063282803267184)

Using my function

lm_test(
    crime1,
    dependent="narr86",
    control_vars=["pcnv", "ptime86", " qemp86"],
    test_vars=["avgsen", "tottime"],
)

(np.float64(4.070729461071595), np.float64(0.13063282803267184))

🚧Computer Exercises¶

Introductory Econometrics with Python

Multiple Regression Analysis: Inference

Introductory Econometrics with Python

Multiple Regression Analysis: Further Issues