Multiple Regression Analysis: Further Issues - Introductory Econometrics with Python

import pandas as pd
import numpy as np
import wooldridge
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

def duan_smearing_estimator(model):
    """
    Compute Duan's smearing estimate (sigma0) from a fitted log-linear model.

    Parameters:
    model: A fitted statsmodels OLS model where the dependent variable is log(y).

    Returns:
    float: Smearing estimate sigma0.
    """
    resid = model.resid  # residuals from log-linear model
    sigma0 = np.mean(np.exp(resid))  # Duan's smearing estimate
    return sigma0

wooldridge.data()

  J.M. Wooldridge (2019) Introductory Econometrics: A Modern Approach,
  Cengage Learning, 6th edition.

  401k       401ksubs    admnrev       affairs     airfare
  alcohol    apple       approval      athlet1     athlet2
  attend     audit       barium        beauty      benefits
  beveridge  big9salary  bwght         bwght2      campus
  card       catholic    cement        census2000  ceosal1
  ceosal2    charity     consump       corn        countymurders
  cps78_85   cps91       crime1        crime2      crime3
  crime4     discrim     driving       earns       econmath
  elem94_95  engin       expendshares  ezanders    ezunem
  fair       fertil1     fertil2       fertil3     fish
  fringe     gpa1        gpa2          gpa3        happiness
  hprice1    hprice2     hprice3       hseinv      htv
  infmrt     injury      intdef        intqrt      inven
  jtrain     jtrain2     jtrain3       kielmc      lawsch85
  loanapp    lowbrth     mathpnl       meap00_01   meap01
  meap93     meapsingle  minwage       mlb1        mroz
  murder     nbasal      nyse          okun        openness
  pension    phillips    pntsprd       prison      prminwge
  rdchem     rdtelec     recid         rental      return
  saving     sleep75     slp75_81      smoke       traffic1
  traffic2   twoyear     volat         vote1       vote2
  voucher    wage1       wage2         wagepan     wageprc
  wine

Examples¶

pd.options.display.float_format = "{:.4f}".format

hprice2 = wooldridge.data("hprice2")
attend = wooldridge.data("attend")
ceosal1 = wooldridge.data("ceosal1")
gpa2 = wooldridge.data("gpa2")
ceosal2 = wooldridge.data("ceosal2")

6.1 Effects of Pollution on Housing Prices¶

To make beta coefficients, I use StandardScaler from sklearn library

wooldridge.data("hprice2", description=True)

name of dataset: hprice2
no of variables: 12
no of observations: 506

+----------+-------------------------------+
| variable | label                         |
+----------+-------------------------------+
| price    | median housing price, $       |
| crime    | crimes committed per capita   |
| nox      | nit ox concen; parts per 100m |
| rooms    | avg number of rooms           |
| dist     | wght dist to 5 employ centers |
| radial   | access. index to rad. hghwys  |
| proptax  | property tax per $1000        |
| stratio  | average student-teacher ratio |
| lowstat  | perc of people 'lower status' |
| lprice   | log(price)                    |
| lnox     | log(nox)                      |
| lproptax | log(proptax)                  |
+----------+-------------------------------+

D. Harrison and D.L. Rubinfeld (1978), “Hedonic Housing Prices and the
Demand for Clean Air,” by Harrison, D. and D.L.Rubinfeld, Journal of
Environmental Economics and Management 5, 81-102. Diego Garcia, a
former Ph.D. student in economics at MIT, kindly provided these data,
which he obtained from the book Regression Diagnostics: Identifying
Influential Data and Sources of Collinearity, by D.A. Belsey, E. Kuh,
and R. Welsch, 1990. New York: Wiley.

hprice2.head()

df = hprice2.copy()
vars_to_standardize = ["price", "nox", "crime", "rooms", "dist", "stratio"]
df[vars_to_standardize] = StandardScaler().fit_transform(df[vars_to_standardize])

df.head()

pd.set_option("display.float_format", "{:.3f}".format)
# pd.reset_option('display.float_format')

model01 = smf.ols("price ~ nox + crime + rooms + dist + stratio", data=df).fit()
model01.summary2().tables[1].iloc[:, :2]

One standard deviation in $nox$ decreases price by 0.34 standard deviations.
One standard deviation in $crime$ decreases price by 0.14 standard deviations.
nox affects price more than crime
t statistic does not change if data are standardized

6.2 Effects of Pollution on Housing Prices¶

model02 = smf.ols(
    "lprice ~ lnox + np.log(dist) + rooms + I(rooms**2) + stratio", data=hprice2
).fit()
model02.summary2().tables[1].iloc[:, :4]

$rooms^2$ is statistically significant. Coefficient on $rooms$ is negative and $rooms^2$ is positive.
Rooms have negative effect on price until some point, the effect becomes positive. To get this point

0.545/[2(0.062)] \approx 4.4

(1)

3 rooms and less decreasing price is not practical, the negative result is due to the sample containing few observations with 3 rooms and less

Effect of adding one extra room

\% \Delta price \approx 100[-0.545 + 2(0.062)]rooms]\Delta rooms

(2)

which is equal to

\% \Delta price = [-54.5 + 12.4 \, rooms] \Delta rooms

(3)

Note: effect of rooms depends on number of rooms, so examining coefficients alone is not enough

6.3 Effects of Attendance on Final Exam Performance¶

wooldridge.data("attend", description=True)

name of dataset: attend
no of variables: 11
no of observations: 680

+----------+------------------------------+
| variable | label                        |
+----------+------------------------------+
| attend   | classes attended out of 32   |
| termGPA  | GPA for term                 |
| priGPA   | cumulative GPA prior to term |
| ACT      | ACT score                    |
| final    | final exam score             |
| atndrte  | percent classes attended     |
| hwrte    | percent homework turned in   |
| frosh    | =1 if freshman               |
| soph     | =1 if sophomore              |
| missed   | number of classes missed     |
| stndfnl  | (final - mean)/sd            |
+----------+------------------------------+

These data were collected by Professors Ronald Fisher and Carl
Liedholm during a term in which they both taught principles of
microeconomics at Michigan State University. Professors Fisher and
Liedholm kindly gave me permission to use a random subset of their
data, and their research assistant at the time, Jeffrey Guilfoyle, who
completed his Ph.D. in economics at MSU, provided helpful hints.

model03 = smf.ols(
    "stndfnl ~ atndrte + priGPA + ACT + " "I(priGPA**2)+ I(ACT**2)+ I(priGPA*atndrte)",
    data=attend,
).fit()
model03.summary2().tables[1].iloc[:, :4]

When $priGPA = 0$ , attendance has a negative effect on exam score
t statistic is misleading. Check the F test

model03.params.index

Index(['Intercept', 'atndrte', 'priGPA', 'ACT', 'I(priGPA ** 2)',
       'I(ACT ** 2)', 'I(priGPA * atndrte)'],
      dtype='object')

model03.f_test("atndrte = 0, I(priGPA * atndrte) = 0")

<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=4.319021763197198, p=0.013683848684002065, df_denom=673, df_num=2>

To estimate partial effect of $atndrte$ , plug values for $priorGPA$ in the interaction term

attend["priGPA"].mean()

np.float64(2.586774999078582)

Effect of $atndrte$ is

-0,0067 + 0.0056(2.59) \approx 0.0078

(4)

if attendance rate increases by 10 percentage points, final exam score increase by 0.078 standard deviations
To test if the 0.0078 is significant from zero

model032 = smf.ols(
    "stndfnl ~ atndrte + priGPA + ACT + "
    "I(priGPA**2)+ I(ACT**2)+ I((priGPA-2.59)*atndrte)",
    data=attend,
).fit()
model032.summary2().tables[1].iloc[:, :4]

Using this method, $atndrte$ coefficient changes to our calculated 0.0078, with p value of 0.0034, statistically significant

6.4 CEO Compensation and Firm Performance¶

wooldridge.data("ceosal1", description=True)

name of dataset: ceosal1
no of variables: 12
no of observations: 209

+----------+-------------------------------+
| variable | label                         |
+----------+-------------------------------+
| salary   | 1990 salary, thousands $      |
| pcsalary | % change salary, 89-90        |
| sales    | 1990 firm sales, millions $   |
| roe      | return on equity, 88-90 avg   |
| pcroe    | % change roe, 88-90           |
| ros      | return on firm's stock, 88-90 |
| indus    | =1 if industrial firm         |
| finance  | =1 if financial firm          |
| consprod | =1 if consumer product firm   |
| utility  | =1 if transport. or utilties  |
| lsalary  | natural log of salary         |
| lsales   | natural log of sales          |
+----------+-------------------------------+

I took a random sample of data reported in the May 6, 1991 issue of
Businessweek.

model04 = smf.ols("salary ~ sales + roe", data=ceosal1).fit()
model04.rsquared, model04.rsquared_adj

(np.float64(0.029171686607148084), np.float64(0.019746169001392255))

model042 = smf.ols("lsalary ~ lsales + roe", data=ceosal1).fit()
model042.rsquared, model042.rsquared_adj

(np.float64(0.28198874478633673), np.float64(0.2750177617260099))

The second model has higher $R^2$ and $\bar R^2$ because variation in $lsalary$ is lower due to taking the $\log$

y = model04.model.endog
y_bar = np.mean(y)
TSS = np.sum((y - y_bar) ** 2)

y = model042.model.endog
y_bar = np.mean(y)
TSS2 = np.sum((y - y_bar) ** 2)

TSS, TSS2

(np.float64(391732982.00956935), np.float64(66.72216321299874))

6.5 Confidence Interval for Predicted College GPA¶

wooldridge.data("gpa2", description=True)

name of dataset: gpa2
no of variables: 12
no of observations: 4137

+----------+----------------------------------+
| variable | label                            |
+----------+----------------------------------+
| sat      | combined SAT score               |
| tothrs   | total hours through fall semest  |
| colgpa   | GPA after fall semester          |
| athlete  | =1 if athlete                    |
| verbmath | verbal/math SAT score            |
| hsize    | size grad. class, 100s           |
| hsrank   | rank in grad. class              |
| hsperc   | high school percentile, from top |
| female   | =1 if female                     |
| white    | =1 if white                      |
| black    | =1 if black                      |
| hsizesq  | hsize^2                          |
+----------+----------------------------------+

For confidentiality reasons, I cannot provide the source of these
data. I can say that  they come from a midsize research university
that also supports men’s and women’s athletics at the Division I
level.

model05 = smf.ols("colgpa ~ sat + hsperc + hsize + hsizesq", data=gpa2).fit()
model05.summary2().tables[1].iloc[:, :2]

model05.summary2().tables[0]

se = model05.mse_resid**0.5
se

np.float64(0.5598638444189318)

Use the model to predict GPA at the values of the independent variables $SAT = 1200, hsperc = 30, hsize = 5$

new_data = pd.DataFrame({"sat": [1200], "hsperc": [30], "hsize": [5], "hsizesq": [25]})

predicted_gpa = model05.predict(new_data)
predicted_gpa

0   2.700
dtype: float64

To calculate confidence interval for our predicted 2.7, subtract the values used for the prediction from the independent variables

df = gpa2.copy()
df["sat"] = gpa2["sat"] - 1200
df["hsize"] = gpa2["hsize"] - 5
df["hsperc"] = gpa2["hsperc"] - 30
df["hsizesq"] = gpa2["hsizesq"] - 25

model052 = smf.ols("colgpa ~ sat + hsperc + hsize + hsizesq", data=df).fit()
model052.summary2().tables[1].iloc[:]

The model resulted in same coefficients as before except for the intercept value. The confidence interval for the prediction is [2.66,2.74]

from statsmodels.iolib.summary2 import summary_col

summary = summary_col(
    [model05, model052],
    stars=True,
    model_names=["Uncentered", "Centered"],
    info_dict={
        "R-squared": lambda x: f"{x.rsquared:.4f}",
        "No. observations": lambda x: f"{int(x.nobs)}",
    },
)

print(summary)


======================================
                 Uncentered  Centered 
--------------------------------------
Intercept        1.4927***  2.7001*** 
                 (0.0753)   (0.0199)  
sat              0.0015***  0.0015*** 
                 (0.0001)   (0.0001)  
hsperc           -0.0139*** -0.0139***
                 (0.0006)   (0.0006)  
hsize            -0.0609*** -0.0609***
                 (0.0165)   (0.0165)  
hsizesq          0.0055**   0.0055**  
                 (0.0023)   (0.0023)  
R-squared        0.2781     0.2781    
R-squared Adj.   0.2774     0.2774    
No. observations 4137       4137      
R-squared        0.2781     0.2781    
======================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

Note: centering $hsize$ and then squaring it results in a different value than squaring $hsize$ and then centering it
The correct order is:

square \rightarrow center

(5)

6.6 Confidence Interval for Future College GPA¶

To get prediction confidence intervals

model06 = smf.ols("colgpa ~ sat + hsperc + hsize + hsizesq", data=gpa2).fit()
model06.summary2().tables[1].iloc[:, :2]

new_data = pd.DataFrame({"sat": [1200], "hsperc": [30], "hsize": [5], "hsizesq": [25]})

pred = model06.get_prediction(new_data)
pred.summary_frame()

The confidence interval for the prediction is [1.6, 3.8]. Its wider because it accounts for individual specific variation [residual variation] due to unobserved factors

Note: Example 6.5 got confidence interval for average college GPA for students with the given characteristics.
Example 6.6 added residual variation

6.7 Predicting CEO Salaries¶

wooldridge.data("ceosal2", description=True)

name of dataset: ceosal2
no of variables: 15
no of observations: 177

+----------+--------------------------------+
| variable | label                          |
+----------+--------------------------------+
| salary   | 1990 compensation, $1000s      |
| age      | in years                       |
| college  | =1 if attended college         |
| grad     | =1 if attended graduate school |
| comten   | years with company             |
| ceoten   | years as ceo with company      |
| sales    | 1990 firm sales, millions      |
| profits  | 1990 profits, millions         |
| mktval   | market value, end 1990, mills. |
| lsalary  | log(salary)                    |
| lsales   | log(sales)                     |
| lmktval  | log(mktval)                    |
| comtensq | comten^2                       |
| ceotensq | ceoten^2                       |
| profmarg | profits as % of sales          |
+----------+--------------------------------+

See CEOSAL1.RAW

model07 = smf.ols("lsalary ~ lsales + lmktval + ceoten", data=ceosal2).fit()
model07.summary2().tables[1].iloc[:, :2]

resid = model07.resid
n = model07.nobs

The Duan Smeaing estimate is

n^{-1} \sum \exp(\hat u)

(6)

sigma0 = np.sum(np.exp(resid)) / n
sigma0

np.float64(1.1356613266630768)

duan_smearing_estimator(model07)

np.float64(1.1356613266630768)

The prediction for $sales = 5000, mktval = 10000, ceoten = 10$

new_data = pd.DataFrame(
    {"lsales": [np.log(5000)], "lmktval": [np.log(10000)], "ceoten": [10]}
)

prediction = model07.predict(new_data)
prediction

0   7.014
dtype: float64

The predicted value is

\exp(y) * \hat \sigma_0

(7)

np.exp(prediction) * sigma0

0   1263.059
dtype: float64

6.8 Predicting CEO Salaries¶

model08 = smf.ols("lsalary ~ lsales + lmktval + ceoten", data=ceosal2).fit()
model08.summary2().tables[1].iloc[:, :2]

To get $salary$ from our log model, take the exponential

\hat m = \exp{\widehat{lsalary}}

(8)

Call it $\hat m$ to avoid the confusion with $\hat y$ that is predicted from the level model

mhat = np.exp(model08.predict())
df = ceosal2.copy()
df["mhat"] = mhat

Then get the correlation between it and y and square it

corr = df["mhat"].corr(df["salary"])
np.square(corr)

np.float64(0.2430807795867979)

The log model explains 24.3% of the variation in $salary$

model082 = smf.ols("salary ~ sales + mktval + ceoten", data=ceosal2).fit()
model082.rsquared

np.float64(0.20127439139676495)

The level model explains $20\%$ of the variation in $salary$ while the log model explains $24.3\%$ of the variation in $salary$ The log model is preferred to the level model

🚧Computer Exercises¶

Introductory Econometrics with Python

Multiple Regression Analysis: OLS Asymptotics

Introductory Econometrics with Python

Multiple Regression Analysis with Qualitative Information