Multiple Regression Analysis: Estimation - Introductory Econometrics with Python

import pandas as pd
import numpy as np
import wooldridge
import statsmodels.formula.api as smf

def summarize_model(model):
    """
    Returns a summary table with parameters, number of observations, and R-squared.

    Parameters:
        model: A fitted statsmodels regression result (e.g., OLS, Logit).

    Returns:
        pd.DataFrame: Summary table.
    """
    params = model.params
    summary_df = pd.DataFrame(
        {
            "Parameter": list(params.index) + ["Number of Observations", "R-squared"],
            "Estimate": list(params.values) + [int(model.nobs), model.rsquared],
        }
    )
    return summary_df

wooldridge.data()

  J.M. Wooldridge (2019) Introductory Econometrics: A Modern Approach,
  Cengage Learning, 6th edition.

  401k       401ksubs    admnrev       affairs     airfare
  alcohol    apple       approval      athlet1     athlet2
  attend     audit       barium        beauty      benefits
  beveridge  big9salary  bwght         bwght2      campus
  card       catholic    cement        census2000  ceosal1
  ceosal2    charity     consump       corn        countymurders
  cps78_85   cps91       crime1        crime2      crime3
  crime4     discrim     driving       earns       econmath
  elem94_95  engin       expendshares  ezanders    ezunem
  fair       fertil1     fertil2       fertil3     fish
  fringe     gpa1        gpa2          gpa3        happiness
  hprice1    hprice2     hprice3       hseinv      htv
  infmrt     injury      intdef        intqrt      inven
  jtrain     jtrain2     jtrain3       kielmc      lawsch85
  loanapp    lowbrth     mathpnl       meap00_01   meap01
  meap93     meapsingle  minwage       mlb1        mroz
  murder     nbasal      nyse          okun        openness
  pension    phillips    pntsprd       prison      prminwge
  rdchem     rdtelec     recid         rental      return
  saving     sleep75     slp75_81      smoke       traffic1
  traffic2   twoyear     volat         vote1       vote2
  voucher    wage1       wage2         wagepan     wageprc
  wine

Examples¶

3.1 Determinants of College GPA¶

wooldridge.data("gpa1", description=True)

name of dataset: gpa1
no of variables: 29
no of observations: 141

+----------+--------------------------------+
| variable | label                          |
+----------+--------------------------------+
| age      | in years                       |
| soph     | =1 if sophomore                |
| junior   | =1 if junior                   |
| senior   | =1 if senior                   |
| senior5  | =1 if fifth year senior        |
| male     | =1 if male                     |
| campus   | =1 if live on campus           |
| business | =1 if business major           |
| engineer | =1 if engineering major        |
| colGPA   | MSU GPA                        |
| hsGPA    | high school GPA                |
| ACT      | 'achievement' score            |
| job19    | =1 if job <= 19 hours          |
| job20    | =1 if job >= 20 hours          |
| drive    | =1 if drive to campus          |
| bike     | =1 if bicycle to campus        |
| walk     | =1 if walk to campus           |
| voluntr  | =1 if do volunteer work        |
| PC       | =1 of pers computer at sch     |
| greek    | =1 if fraternity or sorority   |
| car      | =1 if own car                  |
| siblings | =1 if have siblings            |
| bgfriend | =1 if boy- or girlfriend       |
| clubs    | =1 if belong to MSU club       |
| skipped  | avg lectures missed per week   |
| alcohol  | avg # days per week drink alc. |
| gradMI   | =1 if Michigan high school     |
| fathcoll | =1 if father college grad      |
| mothcoll | =1 if mother college grad      |
+----------+--------------------------------+

Christopher Lemmon, a former MSU undergraduate, collected these data
from a survey he took of MSU students in Fall 1994.

df = wooldridge.data("gpa1")
df.head()

model01 = smf.ols("colGPA ~ hsGPA + ACT", data=df).fit()
model01.params, model01.nobs

(Intercept    1.286328
 hsGPA        0.453456
 ACT          0.009426
 dtype: float64,
 141.0)

1.29 is the predicted college GPA is botg of hsGPA and ACT are zero, not meaningful cuz no one has zero high school GPA or a zero in ACT

Holding ACT fixed, another point on hsGPA is associated with 0.45 point increase in colGPA

Holding hsGPA fixed, a 10 unit change in ACT is associated with 0.094 change in colGPA, a very small effect

3.2 Hourly Wage Equation¶

wooldridge.data("wage1", description=True)

name of dataset: wage1
no of variables: 24
no of observations: 526

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| wage     | average hourly earnings         |
| educ     | years of education              |
| exper    | years potential experience      |
| tenure   | years with current employer     |
| nonwhite | =1 if nonwhite                  |
| female   | =1 if female                    |
| married  | =1 if married                   |
| numdep   | number of dependents            |
| smsa     | =1 if live in SMSA              |
| northcen | =1 if live in north central U.S |
| south    | =1 if live in southern region   |
| west     | =1 if live in western region    |
| construc | =1 if work in construc. indus.  |
| ndurman  | =1 if in nondur. manuf. indus.  |
| trcommpu | =1 if in trans, commun, pub ut  |
| trade    | =1 if in wholesale or retail    |
| services | =1 if in services indus.        |
| profserv | =1 if in prof. serv. indus.     |
| profocc  | =1 if in profess. occupation    |
| clerocc  | =1 if in clerical occupation    |
| servocc  | =1 if in service occupation     |
| lwage    | log(wage)                       |
| expersq  | exper^2                         |
| tenursq  | tenure^2                        |
+----------+---------------------------------+

These are data from the 1976 Current Population Survey, collected by
Henry Farber when he and I were colleagues at MIT in 1988.

df = wooldridge.data("wage1")
df.head()

model02 = smf.ols("np.log(wage) ~ educ + exper + tenure", data=df).fit()
model02.params, model02.nobs

(Intercept    0.284360
 educ         0.092029
 exper        0.004121
 tenure       0.022067
 dtype: float64,
 526.0)

Holding exper and tenure fixed, one more year of education is predicted to increase wage by $9.2\%$

3.3 Participation in 401(k) Pension Plans¶

wooldridge.data("401k", description=True)

name of dataset: 401k
no of variables: 8
no of observations: 1534

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| prate    | participation rate, percent     |
| mrate    | 401k plan match rate            |
| totpart  | total 401k participants         |
| totelg   | total eligible for 401k plan    |
| age      | age of 401k plan                |
| totemp   | total number of firm employees  |
| sole     | = 1 if 401k is firm's sole plan |
| ltotemp  | log of totemp                   |
+----------+---------------------------------+

L.E. Papke (1995), “Participation in and Contributions to 401(k)
Pension Plans:Evidence from Plan Data,” Journal of Human Resources 30,
311-325. Professor Papke kindly provided these data. She gathered them
from the Internal Revenue Service’s Form 5500 tapes.

df = wooldridge.data("401k")
df.head()

model03 = smf.ols("prate ~ mrate + age", data=df).fit()
model03.params, model03.nobs

(Intercept    80.119047
 mrate         5.521289
 age           0.243147
 dtype: float64,
 1534.0)

model032 = smf.ols("prate ~ mrate", data=df).fit()
model032.params, model032.nobs

(Intercept    83.075455
 mrate         5.861079
 dtype: float64,
 1534.0)

df["age"].corr(df["mrate"])

np.float64(0.11878414864794844)

The simple regression estimate of mrate effect is different from multiple regression, but the difference is not big due to the small correlation of 0.12

3.4 Determinants of College GPA¶

df = wooldridge.data("gpa1")

model04 = smf.ols("colGPA ~ hsGPA + ACT", data=df).fit()
summarize_model(model04)

$R^2 = 17.6\%$ means that hsGPA and ACT both explain around 17.6% of the variation in colGPA in the sample

3.5 Explaining Arrest Records¶

wooldridge.data("crime1", description=True)

name of dataset: crime1
no of variables: 16
no of observations: 2725

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| narr86   | # times arrested, 1986          |
| nfarr86  | # felony arrests, 1986          |
| nparr86  | # property crme arr., 1986      |
| pcnv     | proportion of prior convictions |
| avgsen   | avg sentence length, mos.       |
| tottime  | time in prison since 18 (mos.)  |
| ptime86  | mos. in prison during 1986      |
| qemp86   | # quarters employed, 1986       |
| inc86    | legal income, 1986, $100s       |
| durat    | recent unemp duration           |
| black    | =1 if black                     |
| hispan   | =1 if Hispanic                  |
| born60   | =1 if born in 1960              |
| pcnvsq   | pcnv^2                          |
| pt86sq   | ptime86^2                       |
| inc86sq  | inc86^2                         |
+----------+---------------------------------+

J. Grogger (1991), “Certainty vs. Severity of Punishment,” Economic
Inquiry 29, 297-309. Professor Grogger kindly provided a subset of the
data he used in his article.

df = wooldridge.data("crime1")
df.head()

model05 = smf.ols("narr86 ~ pcnv + ptime86 + qemp86", data=df).fit()
summarize_model(model05)

model052 = smf.ols("narr86 ~ pcnv + ptime86 + qemp86 + avgsen", data=df).fit()
summarize_model(model052)

Adding average sentence increases $R^2$ from 0.041 to 0.042 which is a small effect. And the sign is surprising.
A longer Average sentence increases criminal activity

3.6 Hourly Wage Equation¶

df = wooldridge.data("wage1")
df.head()

model06 = smf.ols("np.log(wage) ~ educ", data=df).fit()
summarize_model(model06)

ability is an omitted variable that is positively correlated with education.
Hence, the equation $wage = \beta_0 + \beta_1 \, educ + v$ is on average too large

we can’t say that 0.083 is greater than $\beta_1$ , it can be higher or lower than the true value
But on average, the estimates from all random samples will be large

3.7 Evaluating a Job Training Program¶

Computer Problems¶

¶

C1 A problem of interest to health officials (and others) is to determine the effects of smoking during pregnancy on infant health. One measure of infant health is birth weight;

a birth weight that is too low can put an infant at risk for contracting various illnesses. Since factors other than cigarette smoking that affect birth weight are likely to be correlated with smoking, we should take those factors into account.

For example, higher income generally results in access to better prenatal care, as well as better nutrition for the mother. An equation that recognizes this is

bwght = \beta_0 + \beta_1 cigs + \beta_2 faminc + u

(1)

(i) What is the most likely sign for $\beta_2$ ?

$\beta_2 > 0$ because more income is likely correlated with better nutirition

(ii) Do you think cigs and faminc are likely to be correlated? Explain why the correlation might be positive or negative

Yes, negative correlation. As income increases, personal care and education will also increase, resulting in fewer ciggarates

(iii) Now, estimate the equation with and without faminc, using the data in BWGHT. Report the results in equation form, including the sample size and R-squared. Discuss your results, focusing on whether adding faminc substantially changes the estimated effect of cigs on bwght

wooldridge.data("bwght", description=True)

name of dataset: bwght
no of variables: 14
no of observations: 1388

+----------+--------------------------------+
| variable | label                          |
+----------+--------------------------------+
| faminc   | 1988 family income, $1000s     |
| cigtax   | cig. tax in home state, 1988   |
| cigprice | cig. price in home state, 1988 |
| bwght    | birth weight, ounces           |
| fatheduc | father's yrs of educ           |
| motheduc | mother's yrs of educ           |
| parity   | birth order of child           |
| male     | =1 if male child               |
| white    | =1 if white                    |
| cigs     | cigs smked per day while preg  |
| lbwght   | log of bwght                   |
| bwghtlbs | birth weight, pounds           |
| packs    | packs smked per day while preg |
| lfaminc  | log(faminc)                    |
+----------+--------------------------------+

J. Mullahy (1997), “Instrumental-Variable Estimation of Count Data
Models: Applications to Models of Cigarette Smoking Behavior,” Review
of Economics and Statistics 79, 596-593. Professor Mullahy kindly
provided the data. He obtained them from the 1988 National Health
Interview Survey.

df = wooldridge.data("bwght")
df.head()

model1 = smf.ols(formula="bwght ~ cigs", data=df).fit()
model1.params, model1.nobs, model1.rsquared

(Intercept    119.771900
 cigs          -0.513772
 dtype: float64,
 1388.0,
 np.float64(0.022729121106052963))

model2 = smf.ols(formula="bwght ~ faminc + cigs", data=df).fit()
model2.params, model2.nobs, model2.rsquared

(Intercept    116.974130
 faminc         0.092765
 cigs          -0.463408
 dtype: float64,
 1388.0,
 np.float64(0.029804837327581657))

df["faminc"].corr(df["cigs"])

np.float64(-0.17304492573586483)

The correlation between faminc and cigs is weakly negative. When faminc is omitted, its positive effect on birth weight is absorbed into the cigs coefficient. Since faminc and cigs are negatively correlated, this leads to downward (negative) bias, making the effect of cigs appear slighly more harmful than it actually is.

C2¶

Use the data in HPRICE1 to estimate the model

\text{price} = \beta_0 + \beta_1 \text{sqrft} + \beta_2 \text{bdrms} + u

(2)

wooldridge.data("hprice1", description=True)

name of dataset: hprice1
no of variables: 10
no of observations: 88

+----------+------------------------------+
| variable | label                        |
+----------+------------------------------+
| price    | house price, $1000s          |
| assess   | assessed value, $1000s       |
| bdrms    | number of bdrms              |
| lotsize  | size of lot in square feet   |
| sqrft    | size of house in square feet |
| colonial | =1 if home is colonial style |
| lprice   | log(price)                   |
| lassess  | log(assess                   |
| llotsize | log(lotsize)                 |
| lsqrft   | log(sqrft)                   |
+----------+------------------------------+

Collected from the real estate pages of the Boston Globe during 1990.
These are homes that sold in the Boston, MA area.

df = wooldridge.data("hprice1")
df.head()

(i) Write out the results in equation form.

model = smf.ols(formula="price ~ sqrft + bdrms", data=df).fit()
model.params, model.nobs, model.rsquared

(Intercept   -19.314996
 sqrft         0.128436
 bdrms        15.198191
 dtype: float64,
 88.0,
 np.float64(0.6319184011671723))

\widehat{\text{price}} = -19.32 + 0.128\, \text{sqrft} + 15.2\, \text{bdrms} \\ n = 88,\quad R^2 = 0.63

(3)

(ii) What is the estimated increase in price for a house with one more bedroom, holding square footage constant?

One bedroom increase will increase price by $15,200 holding other factors fixed (price in equation is in thousands)

(iii) What is the estimated increase in price for a house with an additional bedroom that is 140 square feet in size? Compare this to your answer in part (ii).

model.params["sqrft"] * 140 + model.params["bdrms"] * 1

np.float64(33.17926041938293)

An additional room will result in home sqrft increasing leading to price increase of $33,179

(iv) What percentage of the variation in price is explained by square footage and number of bedrooms?

About 63.2%

(v) The first house in the sample has sqrft = 2,438 and bdrms= 4. Find the predicted selling price for this house from the OLS regression line.

model.predict(df.iloc[0])

0    354.605249
dtype: float64

print(
    model.params["Intercept"] + model.params["sqrft"] * 2438 + model.params["bdrms"] * 4
)

354.6052489839994

The predicted price is $354,605

(vi) The actual selling price of the first house in the sample was $300,000 (so price = 300). Find the residual for this house. Does it suggest that the buyer underpaid or overpaid for the house?

df.loc[0]["price"] - model.predict(df.iloc[0])

0   -54.605249
dtype: float64

Residual value is -54.605 which means that the buyer underpaid the house by $54,605

C3¶

The file CEOSAL2 contains data on 177 chief executive officers and can be used to examine the effects of firm performance on CEO salary

wooldridge.data("ceosal2", description=True)

name of dataset: ceosal2
no of variables: 15
no of observations: 177

+----------+--------------------------------+
| variable | label                          |
+----------+--------------------------------+
| salary   | 1990 compensation, $1000s      |
| age      | in years                       |
| college  | =1 if attended college         |
| grad     | =1 if attended graduate school |
| comten   | years with company             |
| ceoten   | years as ceo with company      |
| sales    | 1990 firm sales, millions      |
| profits  | 1990 profits, millions         |
| mktval   | market value, end 1990, mills. |
| lsalary  | log(salary)                    |
| lsales   | log(sales)                     |
| lmktval  | log(mktval)                    |
| comtensq | comten^2                       |
| ceotensq | ceoten^2                       |
| profmarg | profits as % of sales          |
+----------+--------------------------------+

See CEOSAL1.RAW

df = wooldridge.data("ceosal2")
df.head()

(i) Estimate a model relating annual salary to firm sales and market value. Make the model of the constant elasticity variety for both independent variables. Write the results out in equation form.

model = smf.ols(
    formula="np.log(salary) ~ np.log(sales) + np.log(mktval)", data=df
).fit()
model.params, model.nobs, model.rsquared

(Intercept         4.620917
 np.log(sales)     0.162128
 np.log(mktval)    0.106708
 dtype: float64,
 177.0,
 np.float64(0.29911361599869846))

\log(\text{salary}) = 4.620917 + 0.162128 \, \log(\text{sales}) + 0.106708 \, \log(\text{mktval})\\ n = 177 \quad R^2 = 0.3

(4)

(ii) Add profits to the model from part (i). Why can this variable not be included in logarithmic form? Would you say that these firm performance variables explain most of the variation in CEO salaries?

df["profits"].describe()

count     177.000000
mean      207.830508
std       404.454296
min      -463.000000
25%        34.000000
50%        63.000000
75%       208.000000
max      2700.000000
Name: profits, dtype: float64

model2 = smf.ols(
    formula=("np.log(salary) ~ np.log(sales) + " "np.log(mktval) + np.log(profits)"),
    data=df,
).fit()

model2.params, model2.nobs, model2.rsquared

e:\miniconda3\Lib\site-packages\pandas\core\arraylike.py:399: RuntimeWarning: invalid value encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
e:\miniconda3\Lib\site-packages\pandas\core\arraylike.py:399: RuntimeWarning: invalid value encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)

(Intercept          4.211870
 np.log(sales)      0.211419
 np.log(mktval)     0.175385
 np.log(profits)   -0.102808
 dtype: float64,
 168.0,
 np.float64(0.3388726790440221))

Can’t use $\log(\text{profit})$ because it contains negative values

model3 = smf.ols(
    formula=("np.log(salary) ~ np.log(sales) + " "np.log(mktval) + profits"), data=df
).fit()

model3.params, model3.nobs, model3.rsquared

(Intercept         4.686924
 np.log(sales)     0.161368
 np.log(mktval)    0.097529
 profits           0.000036
 dtype: float64,
 177.0,
 np.float64(0.2993366493504458))

model3.params["profits"] * 1000

np.float64(0.0356600694005661)

If Profits increase by 1 billion, salary is expected to increase approximately by 3.6%, ceteris paribus

The model explains around 30% of the sample variation in log sales which is not a lot.

(iii) Add the variable ceoten to the model in part (ii). What is the estimated percentage return for another year of CEO tenure, holding other factors fixed?

model4 = smf.ols(
    formula=("np.log(salary) ~ np.log(sales) + " "np.log(mktval) + profits + ceoten"),
    data=df,
).fit()

model4.params, model4.nobs, model4.rsquared

(Intercept         4.557780
 np.log(sales)     0.162234
 np.log(mktval)    0.101760
 profits           0.000029
 ceoten            0.011685
 dtype: float64,
 177.0,
 np.float64(0.3182987842099966))

model4.params["ceoten"] * 100

np.float64(1.1684668422563895)

One additional year for a CEO is expected to increase salary by approximately 1.17%

(iv) Find the sample correlation coefficient between the variables log(mktval) and profits. Are these variables highly correlated? What does this say about the OLS estimators?

df["profits"].corr(np.log(df["mktval"]))

np.float64(0.7768975970597894)

The correlation is roughly 0.78, which indicates presence of multicollinearity. Multicollineairty does not bias the OLS esitmators but will increase the variance.

C4¶

Use the data in ATTEND for this exercise

wooldridge.data("attend", description=True)

name of dataset: attend
no of variables: 11
no of observations: 680

+----------+------------------------------+
| variable | label                        |
+----------+------------------------------+
| attend   | classes attended out of 32   |
| termGPA  | GPA for term                 |
| priGPA   | cumulative GPA prior to term |
| ACT      | ACT score                    |
| final    | final exam score             |
| atndrte  | percent classes attended     |
| hwrte    | percent homework turned in   |
| frosh    | =1 if freshman               |
| soph     | =1 if sophomore              |
| missed   | number of classes missed     |
| stndfnl  | (final - mean)/sd            |
+----------+------------------------------+

These data were collected by Professors Ronald Fisher and Carl
Liedholm during a term in which they both taught principles of
microeconomics at Michigan State University. Professors Fisher and
Liedholm kindly gave me permission to use a random subset of their
data, and their research assistant at the time, Jeffrey Guilfoyle, who
completed his Ph.D. in economics at MSU, provided helpful hints.

df = wooldridge.data("attend")
df.head()

(i) Obtain the minimum, maximum, and average values for the variables atndrte, priGPA, and ACT

df[["atndrte", "priGPA", "ACT"]].agg(["min", "max", "mean"])

(ii) Estimate the model

atndrte = \beta_0 + \beta_1 \, priGPA + \beta_2 \, ACT + u

(5)

and write the results in equation form. Interpret the intercept. Does it have a useful meaning?

model = smf.ols("atndrte ~ priGPA + ACT", data=df).fit()
model.params, model.nobs, model.rsquared

(Intercept    75.700405
 priGPA       17.260591
 ACT          -1.716553
 dtype: float64,
 680.0,
 np.float64(0.29058148401574935))

atndrte = 75.7 + 17.3 \, priGPA - 1.7 \, ACT \\ n = 680 \qquad R^2 = 0.29

(6)

The intercept indicates that a student with a prior GPA of zero and an ACT score of zero is predicted to have an average attendance rate of 75.7%. However, this estimate is not reliable, as it is the result of extrapolation beyond the range of the observed data—no students in the sample have both a zero GPA and a zero ACT score.

(iii) Discuss the estimated slope coefficients. Are there any surprises?

df["priGPA"].corr(df["ACT"])

np.float64(0.3537693606849367)

A one point increase in pri GPA increases attendace rate by 17.3 percentage points, holding ACT fixed.
A one point increase in ACT score decreases attendace rate by 1.7 percentage points, holding prior GPA fixed. The negative relation may reflect that people with high ACT scores believe they can pass the lectures

(iv) What is the predicted atndrte if priGPA = 3.65 and ACT = 20? What do you make of this result? Are there any students in the sample with these values of the explanatory variables?

new_data = pd.DataFrame({"priGPA": [3.65], "ACT": [20]})
model.predict(new_data)

0    104.370504
dtype: float64

tolerance = 0.1

df.loc[
    (df["priGPA"].sub(3.65).abs() <= tolerance) & (df["ACT"].sub(20).abs() <= tolerance)
]

A student with prior GPA of 3.65 and ACT score of 20 is exptected to have 104% attendance rate.
Since that attendance rate is bounded between zero and 100%, we predict 100% instead
The actual student with this data has participation rate of 87.5%

(v) If Student A has priGPA = 3.1 and ACT = 21 and Student B has priGPA = 2.1 and ACT = 26, what is the predicted difference in their attendance rates?

new_data = pd.DataFrame({"priGPA": [3.1, 2.1], "ACT": [21, 26]})
y_hat = model.predict(new_data)
y_hat[0] - y_hat[1]

np.float64(25.843355740233946)

The difference between the two predictions is around 25.84 percentage points

Introductory Econometrics with Python

The Simple Regression Model

Introductory Econometrics with Python

Multiple Regression Analysis: Inference