Skip to article frontmatterSkip to article content

Multiple Regression Analysis: Further Issues

import pandas as pd
import numpy as np
import wooldridge
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
def duan_smearing_estimator(model):
    """
    Compute Duan's smearing estimate (sigma0) from a fitted log-linear model.

    Parameters:
    model: A fitted statsmodels OLS model where the dependent variable is log(y).

    Returns:
    float: Smearing estimate sigma0.
    """
    resid = model.resid  # residuals from log-linear model
    sigma0 = np.mean(np.exp(resid))  # Duan's smearing estimate
    return sigma0
wooldridge.data()
  J.M. Wooldridge (2019) Introductory Econometrics: A Modern Approach,
  Cengage Learning, 6th edition.

  401k       401ksubs    admnrev       affairs     airfare
  alcohol    apple       approval      athlet1     athlet2
  attend     audit       barium        beauty      benefits
  beveridge  big9salary  bwght         bwght2      campus
  card       catholic    cement        census2000  ceosal1
  ceosal2    charity     consump       corn        countymurders
  cps78_85   cps91       crime1        crime2      crime3
  crime4     discrim     driving       earns       econmath
  elem94_95  engin       expendshares  ezanders    ezunem
  fair       fertil1     fertil2       fertil3     fish
  fringe     gpa1        gpa2          gpa3        happiness
  hprice1    hprice2     hprice3       hseinv      htv
  infmrt     injury      intdef        intqrt      inven
  jtrain     jtrain2     jtrain3       kielmc      lawsch85
  loanapp    lowbrth     mathpnl       meap00_01   meap01
  meap93     meapsingle  minwage       mlb1        mroz
  murder     nbasal      nyse          okun        openness
  pension    phillips    pntsprd       prison      prminwge
  rdchem     rdtelec     recid         rental      return
  saving     sleep75     slp75_81      smoke       traffic1
  traffic2   twoyear     volat         vote1       vote2
  voucher    wage1       wage2         wagepan     wageprc
  wine

Examples

pd.options.display.float_format = "{:.4f}".format
hprice2 = wooldridge.data("hprice2")
attend = wooldridge.data("attend")
ceosal1 = wooldridge.data("ceosal1")
gpa2 = wooldridge.data("gpa2")
ceosal2 = wooldridge.data("ceosal2")

6.1 Effects of Pollution on Housing Prices

To make beta coefficients, I use StandardScaler from sklearn library

wooldridge.data("hprice2", description=True)
name of dataset: hprice2
no of variables: 12
no of observations: 506

+----------+-------------------------------+
| variable | label                         |
+----------+-------------------------------+
| price    | median housing price, $       |
| crime    | crimes committed per capita   |
| nox      | nit ox concen; parts per 100m |
| rooms    | avg number of rooms           |
| dist     | wght dist to 5 employ centers |
| radial   | access. index to rad. hghwys  |
| proptax  | property tax per $1000        |
| stratio  | average student-teacher ratio |
| lowstat  | perc of people 'lower status' |
| lprice   | log(price)                    |
| lnox     | log(nox)                      |
| lproptax | log(proptax)                  |
+----------+-------------------------------+

D. Harrison and D.L. Rubinfeld (1978), “Hedonic Housing Prices and the
Demand for Clean Air,” by Harrison, D. and D.L.Rubinfeld, Journal of
Environmental Economics and Management 5, 81-102. Diego Garcia, a
former Ph.D. student in economics at MIT, kindly provided these data,
which he obtained from the book Regression Diagnostics: Identifying
Influential Data and Sources of Collinearity, by D.A. Belsey, E. Kuh,
and R. Welsch, 1990. New York: Wiley.
hprice2.head()
Loading...
df = hprice2.copy()
vars_to_standardize = ["price", "nox", "crime", "rooms", "dist", "stratio"]
df[vars_to_standardize] = StandardScaler().fit_transform(df[vars_to_standardize])
df.head()
Loading...
pd.set_option("display.float_format", "{:.3f}".format)
# pd.reset_option('display.float_format')
model01 = smf.ols("price ~ nox + crime + rooms + dist + stratio", data=df).fit()
model01.summary2().tables[1].iloc[:, :2]
Loading...

One standard deviation in noxnox decreases price by 0.34 standard deviations.
One standard deviation in crimecrime decreases price by 0.14 standard deviations.
nox affects price more than crime
t statistic does not change if data are standardized

6.2 Effects of Pollution on Housing Prices

model02 = smf.ols(
    "lprice ~ lnox + np.log(dist) + rooms + I(rooms**2) + stratio", data=hprice2
).fit()
model02.summary2().tables[1].iloc[:, :4]
Loading...

rooms2rooms^2 is statistically significant. Coefficient on roomsrooms is negative and rooms2rooms^2 is positive.
Rooms have negative effect on price until some point, the effect becomes positive. To get this point

0.545/[2(0.062)]4.40.545/[2(0.062)] \approx 4.4

3 rooms and less decreasing price is not practical, the negative result is due to the sample containing few observations with 3 rooms and less

Effect of adding one extra room

%Δprice100[0.545+2(0.062)]rooms]Δrooms\% \Delta price \approx 100[-0.545 + 2(0.062)]rooms]\Delta rooms

which is equal to

%Δprice=[54.5+12.4rooms]Δrooms\% \Delta price = [-54.5 + 12.4 \, rooms] \Delta rooms

Note: effect of rooms depends on number of rooms, so examining coefficients alone is not enough

6.3 Effects of Attendance on Final Exam Performance

wooldridge.data("attend", description=True)
name of dataset: attend
no of variables: 11
no of observations: 680

+----------+------------------------------+
| variable | label                        |
+----------+------------------------------+
| attend   | classes attended out of 32   |
| termGPA  | GPA for term                 |
| priGPA   | cumulative GPA prior to term |
| ACT      | ACT score                    |
| final    | final exam score             |
| atndrte  | percent classes attended     |
| hwrte    | percent homework turned in   |
| frosh    | =1 if freshman               |
| soph     | =1 if sophomore              |
| missed   | number of classes missed     |
| stndfnl  | (final - mean)/sd            |
+----------+------------------------------+

These data were collected by Professors Ronald Fisher and Carl
Liedholm during a term in which they both taught principles of
microeconomics at Michigan State University. Professors Fisher and
Liedholm kindly gave me permission to use a random subset of their
data, and their research assistant at the time, Jeffrey Guilfoyle, who
completed his Ph.D. in economics at MSU, provided helpful hints.
model03 = smf.ols(
    "stndfnl ~ atndrte + priGPA + ACT + " "I(priGPA**2)+ I(ACT**2)+ I(priGPA*atndrte)",
    data=attend,
).fit()
model03.summary2().tables[1].iloc[:, :4]
Loading...

When priGPA=0priGPA = 0, attendance has a negative effect on exam score
t statistic is misleading. Check the F test

model03.params.index
Index(['Intercept', 'atndrte', 'priGPA', 'ACT', 'I(priGPA ** 2)', 'I(ACT ** 2)', 'I(priGPA * atndrte)'], dtype='object')
model03.f_test("atndrte = 0, I(priGPA * atndrte) = 0")
<class 'statsmodels.stats.contrast.ContrastResults'> <F test: F=4.319021763197198, p=0.013683848684002065, df_denom=673, df_num=2>

To estimate partial effect of atndrteatndrte, plug values for priorGPApriorGPA in the interaction term

attend["priGPA"].mean()
np.float64(2.586774999078582)

Effect of atndrteatndrte is

0,0067+0.0056(2.59)0.0078-0,0067 + 0.0056(2.59) \approx 0.0078


if attendance rate increases by 10 percentage points, final exam score increase by 0.078 standard deviations
To test if the 0.0078 is significant from zero

model032 = smf.ols(
    "stndfnl ~ atndrte + priGPA + ACT + "
    "I(priGPA**2)+ I(ACT**2)+ I((priGPA-2.59)*atndrte)",
    data=attend,
).fit()
model032.summary2().tables[1].iloc[:, :4]
Loading...

Using this method, atndrteatndrte coefficient changes to our calculated 0.0078, with p value of 0.0034, statistically significant

6.4 CEO Compensation and Firm Performance

wooldridge.data("ceosal1", description=True)
name of dataset: ceosal1
no of variables: 12
no of observations: 209

+----------+-------------------------------+
| variable | label                         |
+----------+-------------------------------+
| salary   | 1990 salary, thousands $      |
| pcsalary | % change salary, 89-90        |
| sales    | 1990 firm sales, millions $   |
| roe      | return on equity, 88-90 avg   |
| pcroe    | % change roe, 88-90           |
| ros      | return on firm's stock, 88-90 |
| indus    | =1 if industrial firm         |
| finance  | =1 if financial firm          |
| consprod | =1 if consumer product firm   |
| utility  | =1 if transport. or utilties  |
| lsalary  | natural log of salary         |
| lsales   | natural log of sales          |
+----------+-------------------------------+

I took a random sample of data reported in the May 6, 1991 issue of
Businessweek.
model04 = smf.ols("salary ~ sales + roe", data=ceosal1).fit()
model04.rsquared, model04.rsquared_adj
(np.float64(0.029171686607148084), np.float64(0.019746169001392255))
model042 = smf.ols("lsalary ~ lsales + roe", data=ceosal1).fit()
model042.rsquared, model042.rsquared_adj
(np.float64(0.28198874478633673), np.float64(0.2750177617260099))

The second model has higher R2R^2 and Rˉ2\bar R^2 because variation in lsalarylsalary is lower due to taking the log\log

y = model04.model.endog
y_bar = np.mean(y)
TSS = np.sum((y - y_bar) ** 2)
y = model042.model.endog
y_bar = np.mean(y)
TSS2 = np.sum((y - y_bar) ** 2)
TSS, TSS2
(np.float64(391732982.00956935), np.float64(66.72216321299874))

6.5 Confidence Interval for Predicted College GPA

wooldridge.data("gpa2", description=True)
name of dataset: gpa2
no of variables: 12
no of observations: 4137

+----------+----------------------------------+
| variable | label                            |
+----------+----------------------------------+
| sat      | combined SAT score               |
| tothrs   | total hours through fall semest  |
| colgpa   | GPA after fall semester          |
| athlete  | =1 if athlete                    |
| verbmath | verbal/math SAT score            |
| hsize    | size grad. class, 100s           |
| hsrank   | rank in grad. class              |
| hsperc   | high school percentile, from top |
| female   | =1 if female                     |
| white    | =1 if white                      |
| black    | =1 if black                      |
| hsizesq  | hsize^2                          |
+----------+----------------------------------+

For confidentiality reasons, I cannot provide the source of these
data. I can say that  they come from a midsize research university
that also supports men’s and women’s athletics at the Division I
level.
model05 = smf.ols("colgpa ~ sat + hsperc + hsize + hsizesq", data=gpa2).fit()
model05.summary2().tables[1].iloc[:, :2]
Loading...
model05.summary2().tables[0]
Loading...
se = model05.mse_resid**0.5
se
np.float64(0.5598638444189318)

Use the model to predict GPA at the values of the independent variables SAT=1200,hsperc=30,hsize=5SAT = 1200, hsperc = 30, hsize = 5

new_data = pd.DataFrame({"sat": [1200], "hsperc": [30], "hsize": [5], "hsizesq": [25]})
predicted_gpa = model05.predict(new_data)
predicted_gpa
0 2.700 dtype: float64

To calculate confidence interval for our predicted 2.7, subtract the values used for the prediction from the independent variables

df = gpa2.copy()
df["sat"] = gpa2["sat"] - 1200
df["hsize"] = gpa2["hsize"] - 5
df["hsperc"] = gpa2["hsperc"] - 30
df["hsizesq"] = gpa2["hsizesq"] - 25
model052 = smf.ols("colgpa ~ sat + hsperc + hsize + hsizesq", data=df).fit()
model052.summary2().tables[1].iloc[:]
Loading...

The model resulted in same coefficients as before except for the intercept value. The confidence interval for the prediction is [2.66,2.74]

from statsmodels.iolib.summary2 import summary_col

summary = summary_col(
    [model05, model052],
    stars=True,
    model_names=["Uncentered", "Centered"],
    info_dict={
        "R-squared": lambda x: f"{x.rsquared:.4f}",
        "No. observations": lambda x: f"{int(x.nobs)}",
    },
)

print(summary)

======================================
                 Uncentered  Centered 
--------------------------------------
Intercept        1.4927***  2.7001*** 
                 (0.0753)   (0.0199)  
sat              0.0015***  0.0015*** 
                 (0.0001)   (0.0001)  
hsperc           -0.0139*** -0.0139***
                 (0.0006)   (0.0006)  
hsize            -0.0609*** -0.0609***
                 (0.0165)   (0.0165)  
hsizesq          0.0055**   0.0055**  
                 (0.0023)   (0.0023)  
R-squared        0.2781     0.2781    
R-squared Adj.   0.2774     0.2774    
No. observations 4137       4137      
R-squared        0.2781     0.2781    
======================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

Note: centering hsizehsize and then squaring it results in a different value than squaring hsizehsize and then centering it
The correct order is:

squarecentersquare \rightarrow center

6.6 Confidence Interval for Future College GPA

To get prediction confidence intervals

model06 = smf.ols("colgpa ~ sat + hsperc + hsize + hsizesq", data=gpa2).fit()
model06.summary2().tables[1].iloc[:, :2]
Loading...
new_data = pd.DataFrame({"sat": [1200], "hsperc": [30], "hsize": [5], "hsizesq": [25]})
pred = model06.get_prediction(new_data)
pred.summary_frame()
Loading...

The confidence interval for the prediction is [1.6, 3.8]. Its wider because it accounts for individual specific variation [residual variation] due to unobserved factors

Note: Example 6.5 got confidence interval for average college GPA for students with the given characteristics.
Example 6.6 added residual variation

6.7 Predicting CEO Salaries

wooldridge.data("ceosal2", description=True)
name of dataset: ceosal2
no of variables: 15
no of observations: 177

+----------+--------------------------------+
| variable | label                          |
+----------+--------------------------------+
| salary   | 1990 compensation, $1000s      |
| age      | in years                       |
| college  | =1 if attended college         |
| grad     | =1 if attended graduate school |
| comten   | years with company             |
| ceoten   | years as ceo with company      |
| sales    | 1990 firm sales, millions      |
| profits  | 1990 profits, millions         |
| mktval   | market value, end 1990, mills. |
| lsalary  | log(salary)                    |
| lsales   | log(sales)                     |
| lmktval  | log(mktval)                    |
| comtensq | comten^2                       |
| ceotensq | ceoten^2                       |
| profmarg | profits as % of sales          |
+----------+--------------------------------+

See CEOSAL1.RAW
model07 = smf.ols("lsalary ~ lsales + lmktval + ceoten", data=ceosal2).fit()
model07.summary2().tables[1].iloc[:, :2]
Loading...
resid = model07.resid
n = model07.nobs

The Duan Smeaing estimate is

n1exp(u^)n^{-1} \sum \exp(\hat u)
sigma0 = np.sum(np.exp(resid)) / n
sigma0
np.float64(1.1356613266630768)
duan_smearing_estimator(model07)
np.float64(1.1356613266630768)

The prediction for sales=5000,mktval=10000,ceoten=10sales = 5000, mktval = 10000, ceoten = 10

new_data = pd.DataFrame(
    {"lsales": [np.log(5000)], "lmktval": [np.log(10000)], "ceoten": [10]}
)
prediction = model07.predict(new_data)
prediction
0 7.014 dtype: float64

The predicted value is

exp(y)σ^0\exp(y) * \hat \sigma_0
np.exp(prediction) * sigma0
0 1263.059 dtype: float64

6.8 Predicting CEO Salaries

model08 = smf.ols("lsalary ~ lsales + lmktval + ceoten", data=ceosal2).fit()
model08.summary2().tables[1].iloc[:, :2]
Loading...

To get salarysalary from our log model, take the exponential

m^=explsalary^\hat m = \exp{\widehat{lsalary}}

Call it m^\hat m to avoid the confusion with y^\hat y that is predicted from the level model

mhat = np.exp(model08.predict())
df = ceosal2.copy()
df["mhat"] = mhat

Then get the correlation between it and y and square it

corr = df["mhat"].corr(df["salary"])
np.square(corr)
np.float64(0.2430807795867979)

The log model explains 24.3% of the variation in salarysalary

model082 = smf.ols("salary ~ sales + mktval + ceoten", data=ceosal2).fit()
model082.rsquared
np.float64(0.20127439139676495)

The level model explains 20%20\% of the variation in salarysalary while the log model explains 24.3%24.3\% of the variation in salarysalary The log model is preferred to the level model

🚧Computer Exercises