Regression Analysis

88. Hypothesis testing

If we take a sample from the population then calculate \(\bar x\) to estimate the population mean.

We can calculate \(\bar x = 185\) and may have a hypothesis that

\[ H_0: \mu = 180\\ H_1: \mu > 180 \]

\(H_0\) is the default case that we will take as granted if we don’t find evidence against it

To test the hypothesis, we depend on central limit theorem that states that \(\bar x\) distribution (aka from repeated samples) will be normal centered around \(\mu\)

To test, we ask: How likely are we to reject the null hypothesis given its true? Knowing that area under the normal curve = 1, we can think of the right tail as probability of getting big height given true height is 180

If probability of getting this big number given \(\mu\) is true is \(\le 0.05\), we reject the null hypothesis

Why?

cuz its very likely that the value in the tail is actually coming from another distribution centered around a number near the big number

If the cutoff value is 175 (at the tail) and \(\bar x\) = 170, we fail to reject the null hypothesis cuz \(P(reject|True, \bar x = 170)> 0.05\)

Note: we never say accept \(H_0\)

89. Hypothesis testing - one and two tailed tests

The previous example was one tail, cuz we were checking if the \(H_1\) is greater than \(H_0\) or not.

We can do one tail the other way aka look at the left side, where \(c\) is cutoff value

\[ P(reject ~null|True, \bar x < -c)\le 0.05 \]

Remember the formula for the right side too

\[ P(reject ~null|True, \bar x >c)\le 0.05 \]

Finally, we have two tail test where \(H_1:\) \(\mu \neq 180\), to decide, we check \(c'\) from both sides, where probability in each side is \(0.025\)

90. Central limit theorem

We mentioned that Hypothesis testing depends on central limit theorem, But what does it say?

Imagine one population, we take N samples, in each sample we calculate \(\bar x\), CLT says that \(\bar x\) frequency distribution will be normal and centered around the mean no matter the distribution of the population

Why is it true?

Imagine a uniform distribution (square ranging from 0 to 1), we expect value of x to be 0.5

\[ E[x] = 0.5 \]

what if we calculate \(\bar x\) by taking average of two variables \(x_1, x_2\)

\[ \bar x = \dfrac{x_1+x_2}{2} \]

the two values range from 0 to 1

The probability of getting \(\bar x = 0.1\) is super small, cuz the two \(x's\) need to be small too. By symmetry, getting super big \(\bar x\) is super hard too

Unlike the numbers in middle, its super easy to get them, hence: the normal distribution is formed

Tip:

try it with discrete numbers \(0.1, 0.2, \dots 0.9, 1\)

91. Hypothesis testing in linear regression part 1

Back to our regression, the population has true \(\beta\) but we don’t know it, so we estimate using inferential statistics

Example:

\[ income = 30000+0.1Pinc+\dots \]

0.1 is super small, maybe its due to sampling error, or its a true effect in the population

So we check the sampling distribution of \(\hat \beta_{LS}\) which is normal with mean \(\beta\) and variance is a function of the error variance \(f(\sigma^2)\) We will not prove it though but its true as sample size \(N \to \infty\)

\[ \hat \beta \sim N(\beta_{True}, f(\sigma^2)) \]

Of course, we don’t know \(f(\sigma^2)\) cuz we don’t know \(\sigma^2\) of the error term \(u \sim N(0, \sigma^2)\) but we can estimate it using the residuals

Since the sampling normal distribution has width based on the variance that we don’t know, we form another distribution with the variance estimate

\[ t = \dfrac{B^*_{LS}}{SE(\hat B_{LS})} \]

meaning value of \(\beta\) divided by standard error <we already calculated the standard error of \(\hat \beta\)>. This distribution is more spread out (fatter) but approaches normal as \(N \to \infty\)

\[ N \to \infty \quad t \to N \]

92. Hypothesis testing in linear regression part 2

Continuing with the last example,

\[ income = 30000+0.1Pinc+\dots \]

is the \(0.1 \, Pinc\) really true effect or due to sampling error? And we said without proof that \(\hat \beta_{LS}\) is normal with variance \(f(\sigma^2)\)

SO we did another distribution based on the estimate of \(\sigma^2\)

\[ t = \dfrac{B^*_{LS}}{SE(\hat B_{LS})} \]

Usually, the \(H_0: \beta = 0\), if the null hypothesis is another value, we adjust \(t\) distribution by subtracting the null value

\[ \boxed{t = \dfrac{B^*_{LS} - \beta_{H0}}{SE(\hat B_{LS})}} \]

the null distribution will be centered around (\(H_0\) which is usually zero) and takes \(t\) distribution.

If we test against \(H_1: \beta >0\)

If the \(\beta^*_{LS}\) = 0.1 and \(se=0.01\), \(t\) = 10

\(t = 10\) is kind of away from \(0\) (\(H_0\)), maybe it comes from the null distribution but with low probability, or maybe its coming from another distribution (the alternative hypothesis \(H_1\)) where \(\beta>0\)

How to decide? calculate the probability, or check \(t ~table\) at \(0.05\) critical value. The critical value is around 2 while we got t statistic of 10. SO we reject the null hypothesis and accept the alternative hypothesis that \(\beta>0\)

93. Hypothesis testing in linear regression part 3

Here is another example for safety, we are testing sales over time and TV advertising

\[ Sales_t = \dots + 10TV_{(3)} \]

\(3\) is the standard error, We need to test that 10 is not 0 but became 10 due to sampling error ( if its zero, we will stop spending money on TV advertising)

\[ H_0: \beta = 0\\ H_1: \beta >0 \]

We don’t care if its negative effect, we only want to know if its positive, so we take one tail

\(t = \frac{10}{3} = 3\frac13\)

we check at significance \(0.05\) (cuz its super small probability) with degree of freedom \(N-K\) (coming from SE) where \(k\) is number of parameters

The critical value is 2 (this is usually the case so take it as rule of thumb)

since that \(3 \frac 1 3>2\) we reject the null hypothesis

94. Hypothesis testing in linear regression part 4

Another example, testing how class size affects test scores

\[ TS = c +\dots +\beta cs \]

\[ H_0: \beta = 0\\ H_1: \beta <0 \]

we run the regression and get this table from a software

Variable coefficients SE t p
c 50 10 5 0
cs -10 20 -0.5 0/863

The first variable is constant (\(\alpha\)), they are usually added by default in most software, make sure it is added anyway

Coefficient values are calculated based on OLS except specified otherwise, then standard error, t statistic and \(p ~value\)

Again, is -10 actually zero? we test again and get \(t=-0.5\), we can look up t table but be careful

only positive values are tabulated, but since that \(t\) is symmetric anyway, we check the absolute value \(|t|\)

\[ |-0.5|<1.7 \to \text{fail to reject H0} \]

Instead of lookup up t table, we check p value, probability of getting -10 given \(H_0\) is true, this probability is 0.863 which is big (we reject when its < 0.05) so we fail to reject \(H_0\)

Note: statistical packages assume by default that we are testing \(\beta = 0\)

The constant is not zero, due to p value

Note: statistical software calculates t value by assuming \(H_0 = 0\). You can modify the default setting

Remember the formula just in case

\[ t = \dfrac{B^*_{LS} - \beta_{H0}}{SE(\hat B_{LS})} \]

95. Hypothesis testing in linear regression part 5

One more example with dummy variables

\[ weight = c + \beta_1~ obese+ \beta_2 ~smoke \]

coefficient SE t p
60 10 6 0.000
-1.5 0.5 -3 0.001
1 1 1 0.435

From a pervious study, we expect \(\beta_1 = -1\) so we test that

\[ H_0: \beta_1 = -1\\ H_1: \beta_1< -1 \]

So I can’t rely on the table cuz it assumes \(H_0: \beta_1 = 0\), lets do it manually

\[ t = \dfrac{-1.5 - (-1)}{0.5} = -1 \]

Looking at t table \(df = N-K\) where K is number of parameters we get t critical = 1.7

\[ |-1|< 1.7\to \text{fail to reject H0} \]

96. Normally distributed errors - finite sample inference

Here are two examples,

  1. relation of \(\log \,TS\) with \(parental~ \, income\)
  2. relation of \(wage\) with \(years ~of~education\)

we expect the first example to have positive relationships (cuz parents will send children to better schools).Although test score is bounded, taking \(\log\) make it go even down

We expect the points to be normally distributed around the line. In practice, they are clustered near the line, so can be considered normal (if G_M assumptions are met)

On the second example, we expect the relation to be constant before 5 years, then positive relationship after 5 years of education.

Before the 5 years, they either get the minimum wage, or don’t get money at all .

Then error lines become normally distributed after 5 years

If we assume the minimum wage to be \(7\), the errors before the 5 years can be drawn as a line at \(-7\) representing unemployed people \(<0-7>\), and the remaining part is zero \(<7-7>\)

The problem with this?

the error terms are not normally distributed, so central limit theorem will not cuz \(\hat \beta\) to be asymptotically normally distributed

With small sample size \(N<30\), we can’t rely on central limit theorem, so we can’t infer (easily)

On the first example, if error terms are normally distributed, \(\hat \beta\) is exactly normally distributed and we can use \(t\) statistic and get exactly \(t\) distribution (cuz we don’t know population variance)

An alternative is using central limit theorem to get approximate normal distribution in case the error terms are not normal

On the second example, error terms are not normally distributed, so \(\hat \beta\) is not normally distributed

Solution:

if sample size \(N>100\), we can use CLT to have asymptotically normal \(\hat \beta\)

If not, use non normal inference which is is outside of scope

If \(N>30\), it will not treat this severe non normality

Note:

The error terms \(u = u_1 + u_2 + \dots + u_p\), they may not be exactly normal, but asymptotically normal due to central limit theorem, will not work if the relation is multiplicative instead of additive

Extra:

  1. If error terms are normal:
    • \(\hatβ\) is exactly normally distributed (finite-sample normality), regardless of N.
  2. If error terms are not normal:
    • For \(N<30\):
      • \(\hat \beta\) is not asymptotically normal, even with mild non-normality.
      • Use non-normal inference (e.g., bootstrapping, permutation tests).
    • For 30≤N<100:
      • If non-normality is mild (e.g., slight skew), \(\hat \beta\) is approximately normal due to the CLT.
      • If non-normality is severe (e.g., bimodal, extreme outliers), \(\hat \beta\) may still be non-normal; use diagnostics (Q-Q plots) or robust methods.
    • For N≥100:
      • \(\hat \beta\) is asymptotically normal (CLT dominates), even for moderately severe non-normality.
      • Standard inference (t-tests) is valid unless errors are extremely non-normal (e.g., heavy-tailed distributions).

97. Tests for normally distributed errors

We will now test for normal errors, remember that normal errors is crucial for small sample size

  1. Jarque Bera

    <pronounced jark-k - bera >

Normal distribution has skewness 0 and kurtosis 3, the statistic

\[ \boxed{JB = \dfrac N 6 (s^2 + \dfrac 1 4 (k-3)^2)\sim^{H0} \chi^2_2} \]

the statistic follows chi square with 2 degrees of freedom. If the probability is small, aka in the tail of the distribution then null distribution is rejected and its not normal

To have a big JB to reject it, we need big skewness and big kurtosis

  1. Shapiro wilk

we plot sample percentiles against normal percentiles. If they form a 45 degrees, then they are normal

  1. kolomogorov-smirnov

Used to test against any distribution. We plot the cdf of the assumed distribution and sample cdf, reject if the distance between them is large

Study shows that shapiro-wilk test is the best and works with \(N<50\). Jarque Bera problem is that its senstivite to outlies

98. Interpreting regression coefficients in linear regression

Bivariate case: house prices with respect to square meters

\[ HP=\alpha + \beta ~SQM \]

we expect \(\alpha\) to be zero, \(\beta\) is the slope

If we have another variable

\[ HP = \alpha + \beta_1 ~BR + \beta2~SQM \]

we pick a line minimizing the sum of squares between the two variables (Imagine a boys sitting with wide legs)

\(\beta_2\) is the marginal effect of adding one more bed room

why? if all other variables are constant

Example: let bedrooms increase by one then the equation is

\[ HP' = \alpha + \beta_1(1+BR)+ \beta_2 SQM \]

so the difference in house price is the two models subtracted from each other. \(1 \cdot \beta\) is the difference

\[ \Delta HP' = \beta_1 \]

which is the partial effect, notice that its also partial differentiation

99. Interpreting regression coefficients in log models part 1

In the context of log models

\[ \ln y = \alpha + \beta_1 ~\ln x_1 + \beta_2 ~\ln x_2 \]

\(\beta_1\) means that if \(\ln x\) increases by 1, how much will \(\ln y\) change

A better explanation is to differentiate

\[ \dfrac{dy}{y} = \beta_1 \dfrac{dx}{x}\\ \boxed{\beta_1 = \dfrac{x_1}{y}\dfrac{dy}{dx} = \dfrac{dy/y}{dx/x}} \]

which is \(\dfrac{\%\Delta y}{\%\Delta x}\) aka partial elasticity

in the log level model

\[ \ln y = \alpha + \beta x_1 \]

\(\beta\) is percentage change in y with respect to one unit increase in x

in level log model

\[ y = \alpha + \beta \ln x_1 \]

\(\beta\) is change in y units with respect to one percent increase in x

100. Interpreting regression coefficients in log models part 2

It is helpful to remember these rules first

\[ a \ln b = \ln b^a\\ e^{a+b} = e^a\cdot e^b\\ e^{\ln x} = x \]

Back to the log model

\[ \ln y = \alpha + \beta_1 ~\ln x_1 + \beta_2 ~\ln x_2 \]

To get back y, we take exponential

\[ \begin{align*} y &= e^{\alpha + \beta_1 ~\ln x_1 + \beta_2 ~\ln x_2}\\ &= e^\alpha e^{\beta_1 \ln x_1} e^{\beta_2 \ln x_2}\\ &= e^\alpha e^{\ln x_1^{\beta_1}} e^{\ln x_2^{\beta_2}}\\ &= e^{\alpha} x_1^{\beta_1}x_2^{\beta_2} \end{align*} \]

Notice that we have \(y\) is non linear and we transformed it to linear relationship by taking \(\ln\).

There is a multiplicative relation between \(x_1,x_2\) and also non linear paremeters

101. The benefits of a log dependent variable

  1. Deals with non linear relationship (multiplicative)

\[ \ln y = \alpha + \beta_1 x_1+\beta_2 x_2\\ y = e^\alpha e^{\beta_1 x_1} e^{\beta_2 x_2} \]

  1. has economic interpretation (elasticity)

\[ \ln y = \alpha + \beta_1 \ln x_1+ \beta_2 \ln x_2\\ y = x_1^{\beta_1}\cdot x_2^{\beta_2} \]

  1. if dependent variable is bounded (\(sales\)) that must be \(\ge0\), theoretically its called limited dependent variable and OLS fails, we use \(\ln y\) to fix this

\[ -\infty < \ln y < \infty \]

  1. Deals with heteroscedasticity (cuz errors shrink)
  2. Make error terms more normal (deals with skewness)

102. Dummy variables - an introduction

\[ sex = \begin{cases}0 &, female\\1, &, male \end{cases} \]

If gender is female, we give it 1, 0 if male, to test for sex discrimination

\[ wage = \alpha +\dots + \beta_p ~gender + u \]

To interpret \(\beta\) , try it when gender is 1 and when gender is 0

\[ \bar{wage}_f = \alpha +\dots + \beta_p\\ \bar{wage}_M = \alpha +\dots +0 \]

The difference between them

\[ \bar{wage}_f - \bar{wage}_M = \beta_p \]

Which is the effect of being female on wage (expect it to be negative due to gender bias)

Another example is \(war\) affected by \(civil~war\)

103. Dummy variables - interaction terms explanation

\[ gender = \begin{cases}0 &, female\\1, &, male \end{cases} \]

What if we add interaction term

\[ wage = \alpha +\beta_1 educ + \beta_2 ~gender + \beta_3~gender \cdot educ \]

To interpret it, again, try model function with males and females aka \(gender = 1\)

\[ \bar{wage_f} = \alpha + \beta_1 educ + \beta_2 + \beta_3 educ\\ = (\alpha + \beta_2) + (\beta_1 + \beta_3)educ \]

as for males

\[ \bar{wage_f} = \alpha + \beta_1 educ \]

\(\beta_2\) represents effect of being a female over being male with zero years of education (if educ = 0, only \(\beta_2\) remains as difference)

\(\beta_3\) is effect of one more year of education for females, if its \(>0\) then the effect of one more year of education is bigger than females than for males

This allows for having different slopes

104. Continuous variables - interaction term interpretation

If we have this equation

\[ sales_t = \alpha + \beta_1 \, price + \beta_2 \, advert.+ \beta_3\, price \cdot advert. \]

we expect \(\beta_1 <0, \beta_2>0\). Let advertisement = 100

\[ \overline{S_{A=100}} = \alpha + \beta_1 price + 100 \beta_2 + 100\beta_3 price \]

if we aggregate the effect of price

\[ (\beta_1 + 100\beta_3)price \]

we expect that \(\beta_3>0\) cuz advertisement decreases sensitivity of consumers reaction to price changes.

If price = 10:

\[ \overline{S_{p=10}} = \alpha + 10\beta_1 + \beta_2 advert.+ 10\beta_3 advert. \]

and aggregate effect of advertisement

\[ (\beta_2 + 10\beta_3)advert \]

\(\beta_3\) here means that if price is higher, the effect of advertisement is higher

105. The F statistic - an introduction

F statistic is used to test multiple hypothesis

\[ y = \alpha + \beta_1 x_1 + \beta_2x_2 +\dots + \beta_p x_p \]

we know that to test \(H_0: \beta_1 = 0\) we use \(t\)

but if we want to test

\[ H_0: \beta_1 = \beta_2 = \dots = \beta_p=0 \]

The \(H_1\) is just one beta not being zero

\[ \boxed{H_0: \beta_1 = \beta_2 = \dots = \beta_p=0\qquad H_1: \beta_i \neq 0} \]

To do it: we have unrestricted regression, which is the original model

\[ y = \alpha + \beta_1 x_1 + \beta_2x_2 +\dots + \beta_p x_p \]

calculate \(SSR\) which is sum of squared errors \(\sum \hat{u}^2\)

Then restrict the model aka restricted model

\[ y= \alpha \]

We expect the error of the restricted to be bigger than the error of the unrestricted

\[ SSR_R>SSR_{UR} \]

Even if the \(\beta's\) are bad, \(SSR_{R}\) will be bigger <remember the explanation of \(R^2\)>,

so instead we ask: is it significantly bigger then \(SSR_{UR}\)? we answer this doing \(F\) statistic

\[ F = \dfrac{SSR_R-SSR_{UR}}{SSR_{UR}} \]

This is unitless, but to be able to tabulate \(F\) statistic, we divide by number of restrictions

\[ \boxed{F = \dfrac{SSR_R-SSR_{UR}/ p}{SSR_{UR}/N-P-1}} \]

where N is number of variables and \(P\) is number of regressors (restrictions). and this follows f distribution

Note:

the \(1\) is not actually \(1\). we subtract number of remaining parameters in the denominator (which is the constant here). In other words, I could have said

\(\boxed{N - \text{number of perimeters in unrestricted model}}\)

Intuition:

if the unrestricted is really effective, the difference in error will be big making \(F\) value large and reject \(H_0\), and accept that \(\beta_i \neq 0\)

\(F\) here has two degree of freedoms:

\[ F\sim p, n-p-1 \]

106. F test - example 1

Lets do it numerically, \(SSR_{UR} = 2000, SSR_R = 4000\)

Unrestricted:

\[ interest = \alpha + \beta_1 \, gov ~spending+ \beta_2 \,debt \]

Restricted:

\[ interest = \alpha \]

we have \(N= 200\) and \(H_0: \beta_1 = \beta_2 =0\)

Theory says they are not equal to 0 .

\[ F = \dfrac{(4000-2000)/2}{2000/200-2-1} \]

Remember that the \(1\) is due to the constant.

\(F_{2,197} = 1000/10 = 100\), big number so we reject

Note: if degree of freedom max in the table is 100, take this number

107. F test - example 2

Unrestricted:

\(sat = \alpha + \beta_1 Parental ~sat + \beta_2 Class Size + \beta_3 siblings\)

Note for \(\beta_3\) we don’t know if its positive or negative cuz parents will spend less time but will learn from siblings

If \(\beta_1\) is significant cuz it has really high \(t\) we test for the other two

\[ H_0: \beta_2 = \beta_3 = 0 \]

The restricted model

\[ sat = \alpha + \beta_1 Parental ~sat \]

\(SSR_{UR} = 100, SSR_R = 120, N=30\)

\[ \boxed{F = \dfrac{(120-100/2)}{100/30-3-1}} \]

so degrees of freedom are 2 and 26, its bigger than critical value so we reject

Danger:

its 30-3-1 not 30-2-1 cuz of the extra parameter in the unrestricted model

108. F test - the similarity with the t test

if we have this model

\[ y = \alpha + \beta_1x_1 + \beta_2x_2 \]

and want to test for \(\beta_2\), we use

\[ t = \dfrac{\beta_2^*}{SE}\sim t_{N-3} \]

But there is another way, using restricted model

\[ y = \alpha + \beta_1x_1 \]

and get the F

\[ F = \dfrac{SSR_R-SSR_{UR}/1}{SSR_{UR}/N-3} \sim F_{1,N-3} \]

Big notice

\[ \boxed{F_{1,N-3} = t^2_{N-3}} \]

They are literraly identical, so either test will work

109. The F test - R squared form

\[ y = \alpha + \beta_1x_1 + \beta_2x_2 +\dots + \beta_px_p \]

And want to test \(\beta_2\) up to \(\beta_p\)

\[ H_0:\beta_2 = \dots = \beta_p = 0\\ H_1: \beta_i \neq 0, i \in [2,p] \]

To solve, we do the restricted model and compare

\[ y = \alpha + \beta_1x_1 \]

instead of comparing \(SSR_{UR}, SSR_{R}\), we can use \(R^2\)

Remembering that \(R^2\) states how well the line fits the model and it increases with new variables even if they are bad. So we expect \(R^2_U\) to be bigger than \(R^2_r\) but is it significant?

\[ \boxed{F = \dfrac{R^2_U-R^2_R}{(1-R^2_U)}} \]

Notice that the denominator is just \(SSR_U\) in disguise cuz

\[ R^2 = \dfrac{ESS}{TSS}= 1 - \dfrac{SSR}{SST}\\ \\ ~\\ \boxed{SSR = (1-R^2)TSS} \]

And to standardize,

\[ F = \dfrac{R^2_U-R^2_R/p-1}{(1-R^2_U)/ N-P-1} \]

Nominator is number of restrictions (\(\beta_1\) is excluded) and denominator divide by number of paremeters including the intercept (1)

statistical packages by default test

\[ H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0 \]

so restricted model becomes

\[ y = \alpha \]

which has \(R^2 = 0\) cuz we don’t explain any deviations away from mean. So F becomes

\[ \boxed{F = \dfrac{R^2/P}{(1-R^2)/N-P-1}} \]

again, this is the one used in statistical packages

110. Testing hypothesis about linear combinations of parameters part 1

\[ score = \alpha + \beta_1 \text{class attendance}+ \beta_2 \text{father education} + \beta_3 \text{mother education} + u \]

and we want to test

\[ H_0: \beta_2 = \beta_3\\ H_1: \beta_2 \neq \beta_3 \]

How to test them? use \(t\) for both variables

\[ \boxed{t = \dfrac{\hat \beta_2 - \hat \beta_3}{SE(\hat \beta_2 - \hat \beta_3)}} \]

As for the standard error, its calculated as

\[ SE(\hat \beta_2 - \hat \beta_3) = [se(\hat \beta_2) + se(\hat \beta_3) - 2 S_{23}]^{1/2} \]

where \(S\) is an estimator for covariance, and can be calculated by some digging then do \(t_{N-K}\) where K = number of parameters (4 here)

But there is another way to avoid the covariance term

we can write the hypothesis as

\[ \boxed{H_0: \beta_2 - \beta_3 = 0} \]

111. Testing hypothesis about linear combinations of parameters part 2

Continuing with the last example,

\[ score = \alpha + \beta_1 \text{class attendance}+ \beta_2 \text{father education} + \beta_3 \text{mother education} + u \]

we modified the hypothesis by rewriting it

\[ H_0: \beta_2 = \beta_3 \qquad H_0: \beta_2 - \beta_3 = 0 \]

instead of zero, we can allow the difference to be any number like delta

\[ \beta_2 - \beta_3 = \delta\\ \beta_2 = \delta + \beta_3 \]

and insert it in the original model

\[ score = \alpha + \beta_1 \text{class attendance}+ (\delta + \beta_2) \text{father education} + \beta_3 \text{mother education} + u \]

Rearrange it to get

\[ score = \alpha + \beta_1 CA + \beta_3(FE + ME)+ \delta FE + u \]

and do the \(t\) test on \(\delta\), if hypothesis is true, \(\delta\) is zero

We can do this same trick on linear combinations with bigger size like \(\beta_2 - \beta_3 - \beta_1 \ =\delta\)

112. Testing hypothesis about linear combinations of parameters part 3

Now if we add \(\beta_4\) parental income to the model

\[ score = \alpha + \beta_1 \text{class attendance}+ \beta_2 \text{father education} + \beta_3 \text{mother education} + \beta_4 \text{pinc}+ u \]

with the null hypothesis of:

\[ H_0: \beta_1 = 0, \beta_2 = 0, \beta_3 = 0 \]

This is the unrestricted model, the restricted model will be

\[ score = \alpha + \beta_4 pinc + u \]

and do \(F\) statistic

\[ \boxed{\dfrac{(SSE_R - SSR_U)/q}{SSR_U/N-k} \sim F_{q,N-K}} \]

where \(q\) is number of restrictions (3) and k number of parameters including constant (5)

113. Testing hypothesis about linear combinations of parameters part 4

\[ score = \alpha + \beta_1 \text{class attendance}+ \beta_2 \text{father education} + \beta_3 \text{mother education} + \beta_4 \text{pinc}+ u \]

If we change the null hypothesis to be

\[ H_0: \beta_2=1, \beta_3 =1. \beta_1=0 \]

We test it by substituting in the restricted model

\[ score = \alpha +1 \,FE +1 \, ME + \beta_4 pinc + u \]

but if we regress, the values that software output will not be exactly \(1\)

Solution: Create a new dependent variable by subtracting the original dependent with the constrained independent variables

\[ \boxed{score -FE-ME = \alpha + \beta_4 pinc + u} \]

and do the \(F\) statistic again

\[ F = \dfrac{(SSR_R-SSR_U)/3 }{SSR_U/N-5} \]

which is same as last time,

Example 2:

If we try to test this restrictions

\[ H_0: \beta_2 = \beta_3, \beta_1 = \beta_4 \]

we have to substitute in the restricted model

\[ score = \alpha + \beta_1(CA+Pinc)+ \beta_2(ME+FE)+ u \]

which is a nested form of the unrestricted model

\[ F = \dfrac{(SSR_R-SSR_U)/2 }{SSR_U/N-5} \]

an Unnested form is two models with different variable, like location instead of parental income, in this case, compare using \(\bar{R^2}\) and \(AIC\)

114. Confidence intervals

If we have an estimate for \(\beta\) called \(\hat \beta\), instead of saying point estimate (one value), we can say a range of values by doing lower bound, upper bound

\[ \boxed{LB = \hat \beta - \delta * SE(\hat \beta)} \]

if we are doing \(95\%\) condifence interval, the \(\delta\) is \(97.5\% *t_{N-K}\) where \(N\) is number of observations and \(K\) is number of independent variables including the constant (aka parameters)

Upper bound is

\[ \boxed{UB = \hat \beta + \delta * SE(\hat \beta)} \]

Why 97.5%? cuz we leave 2.5% from both sides, creating the 95%

Here is it graphically:

one population → take many samples, in each sample calculate \(\hat \beta\) and \(CI\)

If we do this N times, 95% of times, \(\beta \in [LB, UB]\)

It does not mean that we are 95% confident (as in probability) it lies in this range, cuz we have no idea what \(\beta\) is anyway

Extra

Notice the main difference, in hypothesis testing we assume \(\mu\) in null distribution, so we center the distribution around it.

In confidence interval, we center the sampling distribution around \(\bar x\) cuz we have no \(\mu\)

Example:

\(\hat \beta = 52\), \(\beta_0 =50\)

NOTE:

for hypothesis testing its called sampling distribution under \(H_0\) or Null distribution

for confidence interval its called sampling distribution of the estimator or sampling distribution for short

115. Goldfeld- Quandt test for heteroscedasticity

Remember that heteroscedasticity states that

\[ Var(u)= \sigma^2f(x_i) \]

How to know if we have heteroscedasticity?

plot the error squared on a scatter plot against independent variable , it will be getting wider or shorter as x changes indicating heteroscedasticity, homoscedasticity otherwise.

Remember also that we don’t have population \(u\) but have its estimator, so we use the residuals \(\hat u\) cuz they are asymptotically equivalent

So we use \(\hat u ^2\) instead

Here is the test:

  1. plot \(\hat u ^2\) against \(X\)
  2. split the plot vertically around the middle
  3. get residuals in both regions
  4. compare using F statistic

\[ \boxed{F = \dfrac{\sum \hat u^2_i/N_2}{\sum \hat u^2_i/N_1} \sim F_{N_2,N_1}} \]

Notice that its bad test cuz we only do it for one X per time, but its a visual test so helps us understand why we have heteroscedasticity (like cuz of having outliers)

116. Breusch Pagan test for heteroscedasticity

This is more general than Goldfeld-Quandt test

\[ y = \alpha + \beta_1x_1 + \dots + \beta_px_p+u \]

and knowing that \(var(u|x)= \sigma^2f(x)\) cuz of heteroscedasticity

we want to test if error term depends on any variable or linear combination of the variables

Here is the test

  1. get \(\hat u ^2\)
  2. regress \(\hat u ^2\) on the \(x's\) aka auxiliary regression

\[ \hat u ^2 = \delta_0 + \delta_1x_1 + \delta_2x_2 + \dots + \delta_px_p \]

  1. if \(\delta\) is significant, then its variable is the cuz, but if we don’t care about knowing the specific variable
  2. do F statistic to see if any variable is significant

\[ H_0: \delta_1 = \delta_2 = \dots =\delta_p=0 \]

\[ F = \dfrac{R^2/p}{(1-R^2)/N-P-1}\sim F_{N,N-P-1} \]

  1. another way is to form \(LM\) statistic

\[ \boxed{LM = NR^2 \sim x^2_p} \]

if \(R^2\) is high, then \(X's\) explain the variance of \(\hat u^2\) so we have heteroscedasticity

117. White test for heteroscedasticity

We still did not allow for linear combinations between \(X's\), white test is even more general than Breusch Pagan test

We will need to test using white test if the residuals show non linear relation, like a parabola

Here is the test

  1. get \(\hat u^2\)
  2. do the auxiliary regression including products of variables and variables squared too

\[ \hat u ^2 = \delta+ \delta_1x_1 + \dots + \delta_px_p + \delta_1x_1^2 +\dots + \delta_px_p^2 + \delta_1 x_1x_2 + \dots + \delta_1x_k x_p \]

  1. Do F test or LM test

Note:

as you add variables, you lose degree of freedoms, test becomes less powerful

So a better way for the test:

  1. get \(\hat u^2\)
  2. do the auxiliary regression on \(\hat y\)

\[ \hat u ^2 = \delta_0 + \delta_1 \hat y + \delta_2 \hat y^2 \]

  1. Do F test or LM test

118. Serial correlation testing - introduction

Serial correlated errors means the error terms are correlated

\[ y_t = \alpha + \beta x_t + \varepsilon _t \]

and errors are correlated like \(AR(1)\)

\[ \varepsilon_t = \rho \varepsilon_{t-1}+u_t \]

where \(\varepsilon\) (pronounced epsilon or error term) follows an AR(1) process.

We don’t have population \(\epsilon\), so we estimate it

we can regress \(\varepsilon\) on \(\varepsilon_{t-1}\) to see if its significant or not using \(t\) statistic

\[ \hat \varepsilon_t =\delta_0+ \delta_1 \hat \varepsilon_{t-1} \]

Of course the null hypothesis is

\[ H_0: \rho = 0 \]

This test has a problem though

  1. in case of endogeneity (relation between \(x_t\) and \(\varepsilon_t\)) then regressing the error on previous error is invalid cuz we did not account for endogeneity
  2. Another problem is that \(t\) statistic will be calculated in statistical packages assuming homoscedasticity (remember that its) which will not be the case here

\[ t = \dfrac{\delta_1^*}{se(\delta_1)} \]

119. Serial correlation - Durbin Watson test

To solve the problems facing the approach of last time, we use Durbin Watson test to test if \(H_0: \rho = 0\)

Here is the population error and its estimate

\[ \varepsilon_t = \rho \varepsilon_{t-1} + u_t\\ \hat{\varepsilon}_t= \delta_0 + \delta_1 \hat{\varepsilon}_{t-1} \]

To evade the problems, we do durbin watson test

The statistic is

\[ \boxed{DW = \dfrac{\sum_{t=2} (\hat{\varepsilon}_t-\hat{\varepsilon}_{t-1})^2}{\sum_{t=1} \hat{\varepsilon}^2_t}} \]

the ols estimator for \(\delta_1\) is very similar to durbin watson

\[ \boxed{OLS = \dfrac{\sum_{t=2} (\hat{\varepsilon}_t-\hat{\varepsilon}_{t-1})^2}{\sum_{t=2} \hat{\varepsilon}^2_{t-1}}} \]

so there is an approximate relationship between them. the \(\delta_1\) from ols has

\[ \boxed{DW \approx 2(1-\delta_1)} \]

but notice that the hypothesis is different (its one tailed)

\[ H_0: \rho = 0\\ H_1: \rho >0 \]

It has two problems:

  1. it does not check p values, so we have inconclusive region (can’t decide)
Value decision
DW>du do not reject null
dl<DW<du inconclusive
DW<dl reject null
  1. its not robust to inclusion of endogenous regressors so it fails when we have lagged dependent variable

Why use Durbin Watson?

  1. for historical reasons and that
  2. \(t\) test relies on asymptotic theory while Durbin Watson is exact

120. Serial correlation testing - Breusch Godfrey test

This is a robust test

When a regression includes a lagged dependent variable (e.g.,

\(y_{t−1}\)) the error term \(ε_t\) may correlate with regressors (X’s), violating the zero conditional mean assumption. Traditional autocorrelation tests (e.g., Durbin-Watson) fail due to endogeneity.

  1. Example:

\[ y_t = \alpha + \beta_1 x_{1t} + \beta_2 x_{2t} + \varepsilon_t \]

where

\[ \varepsilon_t = \bar{\rho}\varepsilon_{t-1}+ \bar{u}_t \]

and we estimate the error term

\[ \hat{\varepsilon}_t= \delta_0 + \delta_1 \hat{\varepsilon}_{t-1} \]

Now the main problem is that \(\varepsilon_t\) and the \(x\) are related so we lack zero conditional mean assumption

Solution: auxiliary regression

regress the residuals on the \(X's\) and lagged error terms

\[ \hat{\varepsilon}_t= \delta_0 +\gamma_1 x_{1t}+\gamma_2x_{2t}+ \delta_1 \hat{\varepsilon}_{t-1} \]

  1. Incase we have lagged dependent variable like here

\[ \boxed{y_t = \alpha + \beta_1 x_{1t} + \beta_2 y_{t-1} + \varepsilon_t} \]

add it to the auxiliary regression

\[ \hat{\varepsilon}_t= \delta_0 +\gamma_1 x_{1t}+\gamma_2 y_{t-1}+ \delta_1 \hat{\varepsilon}_{t-1} \]

  1. We use \(AR(1)\) cuz we have relation between \(y_t, y_{t-1}\). If we have higher order process like \(AR(2)\)

\[ \boxed{\varepsilon_t = \rho\varepsilon_{t-1}+ \rho_2 \varepsilon_{t-2}} \]

add it to the auxiliary regression \(\rho_1, \rho_2\) and test it

\[ \hat{\varepsilon}_t= \delta_0 +\gamma_1 x_{1t}+\gamma_2 y_{t-1}+ \delta_1 \hat{\varepsilon}_{t-1} + \delta_3 \varepsilon_{t-2} \]

then finally check the statistic \(LM\)

\[ \boxed{LM = (N-q)R^2 \sim \chi^2_q} \]

N: number of observations, q: order of serial correlation

Summary:

we add explanatory variables, lagged residuals, lagged dependent variables to the the auxiliary regression to isolate the effects and see the relation of of autocorrelation

121. Ramsey RESET test for functional misspecification

If the data \(y_i,x_i\) when plotted show a parabola, we can guess the independent variable is actually squared

\[ y_i = \alpha + \beta_1 x_i + \beta_2x_i^2 + \varepsilon_i \]

easy, but hard when we have many independent variables

\[ y_i = \alpha + \beta_1 x_1 + \beta_2x_2 +\dots + \beta_p x_p+ \varepsilon_i \]

we will have to add all the independent variable squared, then all their cross products (interactions) too which causes us to lose degrees of freedom

We need a way to know the correct function form without losing all degree of freedoms

Solution:

  1. run the original linear regression

    \[ y_i = \alpha + \beta_1 x_1 + \beta_2x_2 +\dots + \beta_p x_p+ \varepsilon_i \]

  2. run the regression again, but with \(\hat{y}^2\) and \(\hat{y}^3\) and so on (can include \(\hat{y}^2\) only)

\[ y_i = \alpha + \beta_1 x_1 + \beta_2x_2 +\dots + \beta_p x_p+ \gamma \hat{y_i}^2 + \gamma_2 \hat{y_i}^3 + \dots \]

  1. we run F test not t, cuz we want to know if we have functional misspecification

122. Gauss Markov violations: summary of issues

Assumption Problems Causes Tests Solutions
No Perfect Collinearity Regression fails (singular matrix); cannot separate variable effects Exact linear relationship (e.g., x1=δ0+δ1x2)
           — | Drop collinear variable(s) |
Homoscedasticity Errors | OLS estimators are biased and not BLUE; standard errors invalid | Omitted variables, functional misspecification, true heteroscedasticity | 1. Goldfeld-Quandt
  1. Breusch-Pagan
  2. White test | (Not specified) | | No Serial Correlation | OLS estimators are biased and not BLUE; standard errors invalid | autocorrelation, omitted variable bias | 1.Durbin-Watson 2.Lagrange multiplier | (Not specified) | | Zero Conditional Mean (Endogeneity) | OLS is biased | Omitted variables, measurement error | No simple tests | Instrumental variables (IV) |

123. Heteroscedasticity: as a symptom of omitted variable bias part 1

There are many causes of heteroscedasticity, each one needs a different treatment.

For example: if in the population, the variance of the error depends on the observation

\[ u_i \sim (0, \sigma_i^2) \]

we take a sample, and estimate the error \(\hat u_i\)

There are two ways to get heteroscedasticity:

  1. true heteroscedasticity in the population error term

\[ income~spent~on~food= \alpha_1+\beta_1income + u_i \]

Income and income% spent on food are negatively correlated but the error term increases as income increases. Variable that explains this error is taste for food

This example may seem like omitted variable bias, but its not cuz the missing variable \(taste\) is not important (not correlated with income)

  1. omitted variable bias |Model heteroscedasticity

There is a non linear relation between \(y,x_1\), need to add \(x_1^2\) to explain the error

\[ y = \alpha + \beta_1x_1+\beta_2x_1^2+u \]

Not adding \(x^2\) is considered omitted variable cuz \(x_1^2\) is correlated with \(x_1\)

Tip:

always check for model heteroscedasticity first

124. Heteroscedasticity: as a symptom of omitted variable bias part 2

\(test~score = \alpha + \beta_1 school~funding\)

If we plot this, we get positive relationship but error variance is not constant (it increases).

We have an omitted variable: \(school~size\), as school size decrease, test score increase, school funding decrease

\[ test~score = \alpha + \beta_1 school~funding+ \beta_2 size \]

125. Serial correlation: a symptom of omitted variable bias

If we are working for ice cream retailer and have ice cream sales over time, and data on price and advertising

\[ sales_t = \alpha + \beta_1 price_t+\beta_2 advert_t \]

Even if the line fits well, we have serial correlation (try Durbin Watson test) cuz we have AR(1) error

\[ u_t = \rho u_{t-1}+\varepsilon_t \]

but this serial correlation is due omitted variable: \(temperature\) and its correlated with advertisement and price so its important variable

\[ sales_t = \alpha + \beta_1 price_t+\beta_2 advert_t+\beta_3 temperature_t \]

126. Heteroscedasticity: dealing with problems caused

WHEN WE CAN’T FIND THE OMITTED VARIABLE

  1. we have biased standard errors that are too small, so all the inference is incorrect
  2. OLS is no longer BLUE. There is a better estimator other than OLS in efficiency that comes from GLS (WLS specifically)

So the solutions that we can take

  1. use white standard errors or Newey west standard errors, which are robust and solve the problem. But it does not solve the problem of heteroscedasticity, just an effect
  2. Solve heteroscedasticity using WLS (Recommended way). But it still has a problem, we are expected to know variance of \(u\) in population like

\[ Var(u|X)=\sigma^2x^2 \]

which we have no idea what it is, but we approximate it, this turns \(GLS\) into \(FGLS\) which is feasible generalized least squares

127. Problem set 3

We focus now on hypothesis testing and model selection.

Practical problem: check presidential election in US 2012

check the book The signal and the noise

Theoretical problem: check log dependent variable