Gauss-Markov assumptions

31. Gauss-Markov assumptions 1

Gauss Markov assumptions if met, make least squares BLUE, aka best estimators out there.

The assumptions are:

  1. population process is linear in parameter

    parameters like \(\alpha, \beta\), but the variable can be non linear like \(\text{educ}^2\)

  2. x and y are a random sample

    each individual is equally likely to be picked and come from the same population

  3. zero conditional mean of error

    \(E(u_i|x) = 0\)

    If I know level of education, I can’t predict if his wage is above average of people with same year of education or below it

32. Gauss-Markov assumptions 2

The other assumptions:

  1. No perfect collinearity in regressors

    \(x_1\) can’t be a linear function of \(x_2\) like \(x_1 = \delta_0 + \delta_1x_2\)

    can’t use meters and kilo meters in the same regression

  2. Homoscedastic errors

    width of scatter plot is constant

    \[ Var(u_i|X_i) = \sigma^2 \]

  3. No serial correlation

    \[ cov(u_i,u_j) = 0 \]

    knowing one error will not help me know the next error

33. Zero conditional mean of errors

What do we mean by zero conditional mean assumption?

\[ \boxed{E[u_i|x_i] =0} \]

If this assumption is not met, the estimator is biased

\[ E[\hat \beta | x_i] \neq \beta^p \]

Meaning, in the sampling distribution, it will not be centered around beta of population

We can write this assumption in another way

\[ \boxed{cov(u_i,x_i)=0} \]

aka no relationship between the independent variable and error term

34. Omitted variable bias example 1

omitted variable bias will ruin the zero conditional mean assumption \(E[u_i|x_i]\neq 0\)

Example: how class room size affects test score? we think the relation is negative, aka \(\beta\) < 0 (bigger class size leads to teacher focusing less on each student)

\[ TS = \alpha + \beta \, CS + u \]

we use OLS to estimate the \(\beta\), but it will be biased

why?

many omitted variables in \(u\) term are correlated with class size and test score, like funding

more funding \(\to\) less class size, negative relationship

more funding \(\to\) better books = better test score, positive effect

so the assumption is not met \(cov(u, CS) \neq 0\)

if \(\tilde \beta\)= -10, class size is taking credit away from funding, so this is over estimation. real \(\beta = -5\)

Why overestimation not underestimation?

product of the effect and relationship is negative \(+\cdot - = -\)

cuz beta is already negative, we multiply by \((-1)\)

\[ \hat \beta_1 = \beta_1 - \text{bias}\\ -10 = -5-5\\ 10 = 5+5 \]

35. Omitted variable bias example 2

\[ wages = \alpha + \beta_1 \, educ + \beta_2 \, ability + u \]

wages is a function of education and ability, but we don’t have ability data so we use education only

\[ wages = \alpha + \beta_1 \, educ + v \]

our error became bigger cuz it includes the omitted ability

It is known that people with more ability take more education, so there is a positive correlation

so zero conditional mean error condition is not satisfied, estimator is biased

\[ cov(v, educ) \neq 0 \]

\(\beta_1\) will take extra credit from \(\beta_2\), so its over estimated aka upwardly biased \(+\cdot + = +\)

36. Omitted variable bias example 3

If we see how IQ is affected by being an African or nope, aka a dummy variable

\[ Africa = \begin{cases} 1, &African \\ 0, &Otherwise \end{cases} \]

we get the regression equation

\[ IQ = \alpha + \beta \, \text{Africa} +u \]

an omitted variable is level of education, which is negatively correlated with Africa. beta is underestimated \(+ \cdot - = -\)

\(\beta^p = 0, \tilde \beta = -10\)


Here is a rule:

  • If the product of the two relationships (correlation and effect) is positive:
    • The bias is positive (upward bias).
    • The estimated coefficient (β) is overestimated (too large in magnitude, either too positive or too negative).
  • If the product of the two relationships is negative:
    • The bias is negative (downward bias).
    • The estimated coefficient (β) is underestimated (too small in magnitude, either too close to zero or too negative).

37. Omitted variable bias proof part 1

wage is affected by education. we estimate the effect from a sample, and omit a variable like ability

ability and education: positive relation,

ability and wage: positive relation

\[ wages = \alpha + \beta_1 \, educ + \beta_2 \, ability + v_i \]

and the omitted ability leads to

\[ wages = \alpha + \beta_1 \, educ +u \]

remember \(\hat \beta\) formula?

\[ \begin{align*} \hat \beta &= \dfrac{cov(y,x)}{var(x)} \\&= \dfrac{cov(\beta x+u,x)}{var(x)} \\&= \dfrac{\beta cov(x,x)}{var(x)} + \dfrac{cov(u,x)}{var(x)} \\&= \beta + \dfrac{cov(u,x)}{var(x)}\\ &= \beta + \dfrac{\sum(educ_i-\bar{ \text{educ}})u_i}{\sum(educ_i-\bar{ \text{educ}})^2} \end{align*} \]

38. Omitted variable bias proof part 2

Continuing the last function, we substitute \(u_i\) with \(\beta_2 \text{ ability} + v_i\) then we take the expectation

$$ \[\begin{align*} \hat \beta_1 &= \beta^P_1 + \dfrac{\sum(educ_i-\bar{ \text{educ}})u_i}{\sum(educ_i-\bar{ \text{educ}})^2}\\ &= \beta_1^P + \dfrac{\sum(educ_i-\bar{ \text{educ}})(\beta_2 \, abil + v_i)}{S^2_{educ}}\\ \end{align*}\] $$

Taking expectation

\[ E[\hat \beta_1] = \beta_1^P + \dfrac{ \beta_2\sum E[(educ_i-\bar{ \text{educ}})abil]}{S^2_{educ}} \]

we forgot about \(v_i\) cuz we assume its perpendicular on \(u_i\)

since education and ability is positive, and \(\beta_2\) is also positive, and can be written as

\[ \boxed{E[\hat \beta_1] = \beta_1^P + + \beta_2 \, bias^2} \]

estimator is upward biased

\[ E[\hat \beta_1]> \beta_1^P \]

39. Reverse causality part 1

If zero conditional mean of errors is not met, we have endogeneity aka x is endogenous <there is a relation between x and \(u\)>, beta is biased.

\[ E[u_i|x_i] \neq 0 \]

The econometrician Eli Berman investigated Iraq war

He was interested on level of violence with respect to development,

\[ violence = \alpha + \beta \, development + u_i \]

****the relation appears to be positive which is wrong

high violence get more development spending.

\[ development = \alpha + \beta \, violence + u_i \]

so if we get from the first model that \(\hat \beta = 5\%\) then the population can be \(\hat \beta = -5\%\) aka we have reverse causality

Solution: use instrumental variable

40. Reverse causality part 2

why reverse causality leads to \(E[u_i|x_i] \neq 0\)

(HDI) is a composite statistic used to rank countries based on their overall achievements in health, education, and income.

we expect civil war to decrease HDI, but having low HDI will cause civil war too

\[ \begin{align*} HDI &= \alpha + \beta CW + u\\ CW &= \delta + \gamma HDI + v \end{align*} \]

to see the problem, state the conditional expectation as covariance of \(u_i,CW\) where \(CW\) is a dummy variable

\[ \begin{align*} cov(u,cw)&= cov(u, \delta+\gamma HDI+v) \\&= \gamma cov(u, HDI)\\&= \gamma cov(u,\alpha + \beta CW+ u)\\&= \dots + \gamma var(u) \\&\neq 0 \end{align*} \]

41. Measurement error in independent variable part 1

Endogeneity means \(E[u_i|x_i] \neq 0\)

we can get endogeneity by measurement error

Example:

over time, we measure company advertising and sales

There is reverse causality but ignore it for now, and true \(\beta\) is 10

\[ sales_t = \alpha + \beta A_t + v \]

But what if we measure level of advertising wrong?

we get the measure M = advertising + error, \(\beta\) becomes weaker and outputs 7, if error increases? \(\beta\) can be zero

As measuring error increases, bias increases

42. Measurement error in independent variable part 2

why measurement errors lead to zero conditional mean error?

back to sales function of advertising, we measure advertising as \(M\) = A + error cuz we don’t have exact data of advertising

substitute M in the equation to take covariance

\[ \begin{align*} S_t &= \alpha + \beta A_t + u_t\\ M_t &= A_t + v_t\\ S_t &= \alpha + \beta(M_t - v_t)+u_t\\ S_t &= \alpha + \beta M_t + (u_t-\beta v_t)\\ cov(u_t-\beta v_t,M_t) &= - \beta cov(v_t,A_+v_t)\\ &= - \beta cov(v_t,v_t)\\ &= -\beta \sigma^2_v\\ &\neq 0 \end{align*} \]

43. Functional misspecification 1

people start with few salaries then get biggest wage in forties and fifties then wage starts to decline, so there is an inverted \(U\) shape so maybe the relation is

\[ \text{wages} = \alpha + \beta_1 \text{age} + \beta_2 \text{age}^2 + u \]

\[ \beta_1 > 0 \qquad \beta_2 <0 \]

if we plot it, it will be an invertef \(U\) shape, so fitting a straight line is a bad idea, straight line means we did not add \(\text{age}^2\), to make the line curvy, we add it

Note: functional misspecification is a form of omitted variable bias

Note2: there is no perfect collinearity, cuz they don’t have linear relationship

44. Functional misspecification 2

How car sales are affected by price?

\[ CS = AP^\beta e^u \]

beta here is less than zero, indicating that

\[ p \to \infty \quad cs \to 0 \]

This indicates that when a price is low, a small change will do a big change in demand, and if car has high price, low change will almost have no effect

If we have the dataset thrown in a scatterplot, and we fit a straight line, we will be biased due to functional misspecification

Notice that cs function is not linear in parameter, so to use this function, we take log

\[ \ln CS = \ln A + \beta \ln P + u \]

\(\beta\) here represents elasticity, percentage change in demand with respect to percentage change in price

45. Linearity in parameters

One of the assumptions is to be linear in parameters, all of the following are linear in parameters even if they are not linear in independent variable

\[ y = \alpha + \beta x\\ y = \alpha + \beta x^2\\ \ln y = \alpha + \beta \ln x \]

A model that will violate this assumption is

\[ y =( \alpha+x)^2 \]

cuz when we expand it, we get

\[ y = \alpha^2 + 2\alpha x + x^2 \]

46. Random sample summary

Mathematical definition:

If we have many random variables \({y_1,\dots,y_n}\), that are independent and have common pdf, then this is a random sample

Example:

\[ \text{wages} = \alpha + \beta \text{educ}+u \]

we choose individuals who have wage and education, if I choose a person with high education, It doesn’t change the probability of another individual

Notice:

If we have two populations where population A is represented as

\[ \text{wages} = \alpha + \beta \text{educ}+u \]

and Population B is represented as

\[ \text{wages} = \alpha + \beta_1 \text{educ}+ \beta_2 \text{ability}+u \]

I can’t take a random sample by selecting individuals from both populations

Another violation will be if we focus on one area instead of the whole population

47. Explanation of random sampling and serial correlation

If we have random sample then we have no serial correlation,

so why state the two assumptions and not just one?

cuz time series has no random sample so we will be dealing with non random samples and series correlations later on.

Proof of random sampling means no serial correlation:

We have a population with a function

\[ y = \alpha + \beta x + u \]

we get N individuals, who are \(iid\), we write the independence mathematically as

$$ E[y_iy_j] = E[y_i]E[y_j] ij\

$$

Using independence, we deduce that

\[ cov(y_i,y_j) = E[y_iy_j]-E[y_i]E[y_j] = 0 \]

If they are independent, then covariance is zero

we can expand the covariance, but take into account that \(x,u\) are uncorrelated by zero conditional mean assumption \(E[u_i|x_i]=0\)

lets expand

\[ \begin{align*} cov(\alpha +\beta x_i+u_i, \alpha + \beta x_j + u_j )\\= \beta^2cov(x_i,x_j)+ cov(u_i,u_j)\\= 0 \end{align*} \]

48. Serial correlation summary

serial correlation means

\[ cov(u_i,u_s)\neq 0 \quad i\neq s \]

aka there is a relationship between the error terms

If we have serial correlation, OLS is no longer best, There are other linear unbiased estimators with smaller sampling variance

To get serial correlation:

  1. omit an important variable
    1. its systematic error, in model not population
  2. functional misspecification
    1. still not true serial correlation, we used wrong wrong model so we have systematic error, this will lead my model to be biased
  3. measurement error in independent variable
    1. will make my model biased

Note: GLS, FGLS can make BLUE estimators, like cochrane-orcutt, praise-winston

49. Serial correlation as a symptom of omitted variable bias

If we model farmers income over time, There is a pattern of ups and downs, our model that we fit is a straight line, so we get runs of ups and downs.

Why?

cuz we did not include an important variable: amount of rain which also has ups and downs that explain the ups and downs in the income

Another example:

sales as a function of advertising and seasonality

our model will have ups and downs due to inclusion of seasonality but may not explain it the best

why?

cuz we dropped an important factor: price

50. Serial correlation as a symptom of functional misspecification

wage as a function of age,

we start with few wage then it grows until we get so old, wage decreases again.

so there is an inverted \(U\) shape, a straight line is not representative and we get runs of ups and downs

Remember: serial correlation is due to runs of ups and downs

To solve the problem, just add \(\text{age}^2\)

51. Serial correlation caused by measurement error

Farmer income with respect to time and rain,

we know amount of rain using a weather station but it is not accurate, it can’t capture the high levels of rain (censored)

Our model will underpredict the income and get serial correlation.

52. Serial correlation biased standard error part 1

Serial correlation can be caused by clustering

Example

\[ TS_{ig} = \alpha + \beta CS_g + \epsilon_{ig} \]

How test score is affected by classroom size, we know that \(\beta<0\) (as size increases, teacher focus per student decreases), and we know that there is error term

The error term consists of two errors, difference in classes <teachers, books etc.> and difference in students

\[ \epsilon_{ig} = v_g + \eta_i \]

\(v_g\) consists of students in class that have one teacher aka clustering, so two individuals \(i,j\) will be correlated

\[ cov(\epsilon_{i},\epsilon_j)= \sigma^2_v \]

variance is due to teacher

53. Serial correlation biased standard error part 2

Back to our test score problem, we said the error terms consists of two errors based on students and teachers variability

\[ TS_{ig} = \alpha + \beta CS_g + \epsilon_{ig} \]

If we model our data, we get \(\beta =-10\) and we get standard error \(SE=1.5\) which we will use later on for inference

But the problem is: programs assume GM assumptions are met, but we have serial correlation, so the true standard error may be \(SE = 5\). This big difference will make all the inference wrong

54. Heteroskedasticity summary

by homoscedasticity, we assume variance of the error term is constant

\[ var(u_i|X_i) = \sigma^2 \]

Visually, we can fit the data between two parallel lines

If this assumptions is not met, we have heteroskedasticity, aka variance of error terms increase or decrease as x change ****

\[ var(u_i|X_i) = f(x_i) \]

why bother?

if its violated, estimators are no longer best

Why?

cuz model is lacking information that explains the change in variance

55. Heteroskedastic errors example 1

How wage depends on education?

\[ wage = \alpha + \beta \, educ + u \]

we expect a linear relationship and \(\beta>0\).

But, as education increases, the options of jobs increase, he can be a rich banker or poor academic

Fitting a straight line will show that errors variance increase with education

\(\beta\) is unbiased cuz it can predict the average case, our problem is with the variance

\[ Var(u_i|educ_i) = \sigma^2 educ \]

However

\[ E[u_i|educ_i]=0 \]

Other estimators are WLS, GLS

56. Heteroskedastic errors example 2

we see percentage spending on food with respect to income

\[ PSF = \alpha + \beta \, inc + u \]

As individual gets richer, the percentage becomes wider as he can be foody or does not like expensive food.

However, if he is poor, he will spend his whole money on food. The richer you are, the more choices you have

57. Heteroskedasticity caused by data aggregation

aggregation means grouping

We see how test score is affected by parental income

\[ TS = \alpha + \beta \, pinc + u_i \]

if parents are rich, student has more computers and books, we expect positive relationship \(\beta>0\).

We plot the data for \(10,000\) individuals. But there is a big constant variability in data So we group individuals and get 500 groups to decrease the variability

\[ \bar{TS} = \alpha + \beta \, \bar{pinc} + \bar u_g \]

No free lunch in econometrics

\(\beta\) becomes not efficient

Heteroskedasticity occurs cuz error is defined as arithmetic mean of each group. For example, groups \(g,f\)

\[ \bar u_g = \dfrac {1} {n_g} (u_{1g} + u_{2g}+\dots + u_{ng}) \]

\[ \bar u_f = \dfrac {1} {n_f} (u_{1f} + u_{2f}+\dots + u_{nf}) \]

so error depends on group and each group has its own variance

\[ var(\bar u_g)\neq var(\bar u_f) \]

58. Perfect collinearity example 1

Finally, the last assumption. no perfect collinearity.

We can’t add square foot and square meter in the same model cuz they measure the same thing

\[ SQF = g*SQM \]

\(\beta\) will not have a value, any program will return an error message

why?

remember that regression is just partial differentiation or partial effect of a variable while holding other factors fixed. If the two variables measure anything, partial differentiation will fail

Another example:

consumption with respect to non labor income, salary, total income

\[ C_i = \alpha + \beta_1 \, NL + \beta_2 \, salary + \beta_3 \, TI + u \]

Notice that non labor income + salary = total income aka exact relationship

\[ TI = NL + salary \]

59. Perfect collinearity example 2

we see test score with respect to size of home and distance from country center

\[ TS = \alpha + \beta_1 \, size + \beta_2 \, country + u_i \]

both are dummy variables, size takes 1 if home has more than 2 bedroom, 0 otherwise,

country takes 1 if close from center, zero otherwise

If we select rich people only, country and size are always 1 so dummy variables caused perfect collinearity

60. Multicollinearity

multicollinearity means that \(x_1,x_2\) have a relationship but not a perfect one.

Example:

sales as a function of Tv advertising and radio advertising

\[ Sales = \alpha + \beta_1 \, TV + \beta_2 \, Radio + u \]

We used Tv and radio at the same time, but spent more on TV

If we check the relation between TV and Radio, we find a positive relationship between them. Regression will not be able to distinguish effect of each variable

Result of multicollinearity:

\(R^2\) is big like 0.9, but individually, each item is not significant, betas have huge standard errors

61. Index- where we currently are

We have a sample that we use OLS to estimate the population. OLS estimators are BLUE when G_M assumptions are met

We do diagnostic tests to see if the assumptions are met or not

If one or more of the assumptions are not met, we use other ways like GLS,ML,IV

Next: proof of Gauss Markov assumptions

62. Gauss-Markov proof part 1

We will proof that the assumptions if met, make the estimator BLUE

The equation is

\[ y =\alpha + \beta x + u \]

And we know that \(\hat \beta\) is

\[ \hat \beta = \dfrac{\sum(x_i-\bar x)y_i}{\sum(x_i-\bar x)^2} = \sum v_iy_i \]

For simplification, we write the mess on the left as \(\sum v_i\)

Notice that this is the reason we assume linearity, we calculate linear sum of \(v_i\)

\[ \boxed{v_i =\dfrac{x_i-\bar x}{s_{xx}}} \]

And it has the following properties

\[ \boxed{\sum v_i = 0} \]

cuz of the numerator, when we sum it we get

\[ \sum(x_i - \bar x) = n \bar x - n \bar x = 0 \]

and if we square \(v_i\) we get variance over variance squared

\[ \boxed{\sum v_i^2 = \dfrac{\sum(x_i - \bar x)^2}{s^2_{xx}} = \dfrac 1 s_{xx}>0} \]

63. Gauss-Markov proof part 2

  1. we will prove first that \(\hat \beta\) is unbiased using zero conditional mean assumption.
  2. Then we get variance and assume no serial correlation, homoscedasticity
  3. Finally, we make a new beta \(\tilde \beta\) and prove that its unbiased and get its variance, and prove that \(\hat \beta\) is efficient by showing \(var(\tilde \beta) \ge var(\hat \beta)\)

Lets do the unbiased thing

\[ \begin{align*} \hat \beta &= \sum v_iy_i\\ &= \sum v_i(\alpha + \beta x + u_i)\\ &= \alpha \sum v_i + \beta \sum v_i x_i + \sum v_iu_i\\ &= 0 + \dfrac{\sum(x_i-\bar x)x_i}{s_{xx}} + \sum v_iu_i\\ &=0 +\beta \dfrac{\sum (x_i - \bar x)^2}{\sum (x_i - \bar x)^2} + \sum v_i u_i\\ &= 0 + 1 \cdot \beta + \sum v_i u_i\\ &= \beta + \sum v_i u_i\\ \end{align*} \]

So for \(\beta\) to be unbiased, we need \(\sum v_i u_i\) to disappear in the expectation to do so, we assume that \(v_i\) and \(u_i\) are independent and expect \(u_i\) to be iid

\[ \begin{align*} E[\hat \beta] &= \beta + \sum E[v_i u_i]\\ &= \beta + \sum v_i E[u_i]\\ &= \beta + 0 \end{align*} \]

Hence we use zero conditional mean assumption

\[ E(u|x_i) = 0 \]

64. Gauss-Markov proof part 3

We proved that \(\beta\) is unbiased using zero conditional mean, now we derive variance assuming no serial correlation, homoscedasticity

We know the formula for \(\hat \beta\)

\[ \hat \beta_{LS} = \beta + \sum v_iu_i \]

Then we take variance

\[ var(\hat \beta_{LS}) = var(\beta + \sum v_iu_i) \]

Since that \(\beta\) is a number with no variance, it disappears. eg: variance of 10 = 0

and we have no serial correlation

\[ var(\hat \beta_{LS}) = var(\sum v_iu_i) = \sum var(v_iu_i) \]

we have no covariance term due to the assumption of serial correlation.

Finally, we assume homoscedasticity, so variance of \(u_i\) is a constant \(\sigma^2\) and \(v_i\) is a number so it goes out of variance squared

\[ \boxed{var(\hat \beta_{ls}) = \sum v_i^2 \sigma^2 = \sigma^2 v_i^2 = \sigma^2 \mu} \]

Next time, we do \(\tilde \beta\) that is unbiased and compare the variance

65. Gauss-Markov proof part 4

Lets do \(\tilde \beta\) to prove that \(\hat \beta\) is more efficient,

let \(b_i\) be some kind of weighting

\[ \begin{align*} \tilde \beta &= \sum b_iy_i\\ &= \sum b_i (\alpha + \beta x_i + u_i)\\ &= \alpha \sum b_i + \beta \sum b_i x_i + \sum b_i u_i \end{align*} \]

Then we take its expectation, last term disappears due to zero conditional mean assumption

\[ \begin{align*} E[\tilde \beta] = \alpha \sum b_i + \beta \sum b_i x_i + 0 \end{align*} \]

To have the expectation \(E [\tilde \beta] = \beta\) we must have these conditions

\[ \sum b_i = 0, \sum b_i x_i = 1 \]

To get

\[ \begin{align*} E[\tilde \beta] = \alpha \cdot 0 + \beta \cdot 1 + 0 = \beta \end{align*} \]

66. Gauss-Markov proof part 5

If we apply the constraints used for \(\tilde \beta\) to be unbiased, we get that

\[ \tilde \beta = \beta + \sum b_iu_i \]

Lets assume that \(b_i\) weights is same as least squares weights \(v_i\) + some difference \(c_i\)

\[ b_i = v_i + c_i \]

Then we substitute

\[ \tilde \beta = \beta + \sum(v_i + c_i)u_i \]

then take the variance

\[ var(\tilde \beta) = \sigma^2 \sum(v_i+c_i)^2 \]

Lets expand the terms

\[ var(\tilde \beta) = \sigma^2(\sum v_i^2 + 2 \sum v_ic_i+ \sum c_i^2) \]

using the \(b_i\) constraints that made it unbiased, we get that

\[ \sum(v_i +c_i) = 0 \to \sum c_i=0 \]

and

\[ \sum(v_i +c_i)x_i = \sum v_i x_i + \sum c_ix_i = 1 \to \sum c_ix_i = 0 \]

The above expression has to sum to one for \(\tilde \beta\) to be unbiased and we know from ols that

\[ \sum v_ix_i = 1 \]

67. Gauss-Markov proof part 6

We have variance of \(\tilde \beta\) and conditions, if we expand \(\sum v_ic_i\) we get zero from the two conditions <\(\sum c_i = 0, \sum c_ix_i=0\)>

\[ \sum \dfrac{x_i - \bar x}{s_{xx}}c_i = \dfrac 1 s_{xx} \sum x_ic_i - \dfrac{\bar x}{s_{xx}}\sum c_i= 0 \]

Hence the variance is

\[ var(\tilde \beta) = \sigma^2 \sum v_i^2 + \sigma^2 \sum c_i^2 \]

while variance of ols is

\[ var(\hat \beta_{LS}) = \sigma^2 \sum v_i^2 \]

and knowing that \(c_i^2\) is positive number then we reach that

\[ var (\tilde \beta)\ge var(\hat \beta) \]

so \(\hat \beta\) is the best linear unbiased estimator

68. Errors in population vs estimated errors

There is a process that creates the variables in the population. For example, in the population we have

\[ wages = \alpha + \beta_1\, educ + \beta_2 \, ability + u \]

Of course the process is a linear combination here. We expect the error term to be \(iid(0, \sigma^2)\)

Taking expectation, the error term disappears. Cuz people with same education and ability can be university lecturers or investment bankers aka below average or above it.

\[ E[wages|educ, abil] = \alpha + \beta_1 \, educ + \beta_2 \, abil \]

In econometrics, we have a sample not a population, so when we fit OLS, we estimate the error term \(\hat u\) and call it residuals.

\[ wages = \alpha + \beta_1\, educ +\hat u \]

Its expectation is not the error term, cuz we lack other terms included in the population process

\[ E[\hat u] \neq u \]

69. Sum of squares

So what is TSS - ESS - RSS?

We have a scatterplot of \(y_i\) and \(x_i\). we can draw a line at the mean \(\bar y\) . The variance is the vertical distance of points from the line at the mean , but squared so lines above don’t cancel points below the line. This is called Total sum of squares TSS

\[ TSS = \sum(y_i - \bar y)^2 \]

Then we fit our Least squares line, and we get \(\hat y\) then see the vertical distance between \(\hat y\) and \(\bar y\). This is explained sum of squares ESS. And it means how much of the variation in \(y\) that we can explain

\[ ESS = \sum(\hat y - \bar y)^2 \]

Our model can’t fit through every point perfectly, so \(ESS \le TSS\), the ratio is \(\le 1\) cuz of the remaining part: difference between \(\hat y\) and \(y\) which is called residual sum of squares RSS

\[ RSS = \sum(y -\hat y)^2 \]

Intuitively:

\[ TSS = ESS+RSS \]

Remember: the distance between \(y , \bar y\) is divided into distance between

  1. \(\bar y, \hat y\)
  2. \(\hat y, y\)

so \(\hat y\) is usually in the middle, This is due to regression towards the mean phenomenon and \(\hat y = \bar y\) at the point \(\bar x\)

70. R squared part 1

Back to our scatterplot, we try to explain the deviation around \(\bar y\) by fitting a line, the distance we explain is the distance between \(\bar y, \hat y\) aka called ESS.

The problem with ESS is that it has units squared, if we measure distance in meters, ESS will be 10 meters squared for example and we have no idea if this is big or not.

Solution:

We do a ratio of the explained from total called R squared

\[ R^2 = \dfrac{ESS}{TSS} \qquad 0\le R^2<1 \]

Max values for \(R^2\)

  1. \(0\) if we estimate that \(\hat y = \bar y\) aka \(ESS = 0\)
  2. 1 if line passes through each point \(\hat y = y\) aka \(ESS = TSS\)

Note: we don’t depend on R squared cuz it has serious problems

71. R squared part 2

Why don’t we count on R squared?

If we have the model

\[ y = \alpha + \beta_1 \, x_1 + u \]

And calculate \(R^2 = 0.65\) which is interpreted as:

65% of the variation in \(y\) is explained by the model

Then we do a new model with extra variables

\[ y = \alpha + \beta_1 \, x_1 + \beta_2 \, x_2+u \]

\(R^2\) must increase.

If we add more variables, \(R^2\) will approach 1, even if the variables are rubbish.

\[ y = \alpha + \beta_1 \, x_1 + \beta_2 \, x_2 \,+ \dots + \beta_p \, x_p +u \]

R squared can’t be used to compare models. Cuz adding more variables makes the line more wiggly and pass through all the points.

72. Degrees of freedom part 1

Scatter plot again, we know the equation of a

\[ y = mx+c \]

if I can choose any numbers for \(m,c\), I can draw infinite lines. cuz I have 2 degrees of freedom, I am free to vary two variables \(m,c\) as I like

If I limit it to

\[ y = mx+2 \]

I can draw infinitely many lines that must pass through the intercept = 2, I have 1 degrees of freedom (I am free to vary \(m\) only)

If the constraint is on \(m\)

\[ y = 5x+c \]

I can draw infinitely many lines parallel to each other with slope = 5, aka 1 degree of freedom

Finally, If I have constraints on both of them

\[ y = 5x+2 \]

I have zero degrees of freedom, only one line satisfies this equation

73. Degrees of freedom part 2

If we have three observations of \(x\), and we know the mean \(\bar x = 2\)

We have infinitely many ways to get \(\bar x = 2\) using \(x_1,x_2,x_3\), however, you may think that you have three degrees of freedom, but you are wrong

If we choose \(x_1 =1\) and \(x_2 = 1\), \(x_3\) has to be equal to 4 so we get \(\bar x = 2\). Meaning that we have 2 degrees of freedom

In general, This situation has \(N-1\) degrees of freedom

Why bother?

If we have a sample from population and try to estimate the variance, we may be tempted to say

\[ S^2_N = \dfrac 1 N \sum (x_i- \bar x)^2 \]

but this is a biased estimator, why?

If we assume \(\bar x = 0\), we can rewrite sample variance as

\[ S^2_N = \dfrac 1N (x_1^2 + x^2_2 + \dots + x_N^2) \]

each of the \(x's\) has an expected variance of \(\sigma^2\). Except the last term has a variance of 0, to satisfy \(\bar x = 0\) (meaning last term is not free to vary so its a constant number with variance 0), we have \(N-1\) degrees of freedom

\[ E[S_N^2] = \dfrac{N-1}{N} \sigma^2 \]

The sample variance underestimates population bias, we fix it by Bessel correction

74. Overfitting in econometrics

If we are trying to explain variations in house prices with respect to square feet, plot it on scatter plot, fit a straight line, then we can predict house price with square feet = 500. This model has for example \(R^2 = 0.55\)

But maybe a squared line is a better fit than straight line, and has higher\(R^2\) \(=0.65\)

If we add millions of variables, we fit every point perfectly,

\[ HP = \alpha + \beta_1 \, SQF + \beta_2 \, SQF + \dots + \beta_p \, SQF^p + u \]

but this model is bad cuz it fits the noise not the signal. This is called overfitting, prediction is all wrong although \(R^2 = 1\). And will do horribly if \(x\) is outside the sample range

Hence, we need Adjusted R Squared \(\bar R^2\)

75. Adjusted R squared

To fix the issues with \(R^2\) we have adjusted R Squared, \(\bar{ R^2}\) which is

\[ \boxed{\bar{R^2} = 1 - \dfrac{(1-R^2)(N-1)}{N-K-1}} \]

What does this mess mean?

\(N:\) number of data points

\(K:\) number of independent variables

\(N\ge K\)

If \(K\) increases, \(N-K-1\) decreases, so ratio gets bigger, and since we subtract by it, the overall value decreases.

\[ k \uparrow, N-K-1 \downarrow, frac \uparrow, \bar R^2 \downarrow \]

Interpretation:

if there is no significant increase in \(R^2\) by adding variables, the \(\bar {R^2}\) falls

Example:

\[ \begin{align*} wages &= \alpha + \beta_1 educ + u\\ wages &= \alpha + \beta_1 educ + \beta_2 \, right~ handed + u \end{align*} \]

We expect \(\bar {R^2}\) to fall in the second model and choose the first model

Extra part:

why \((N-1)\) and \((N-K-1)\)?

when we calculate \(TSS\) we estimate \(\bar x\) so we have \(N-1\) degree of freedoms

and when we calculate \(ESS\), we estimate \((K+1)\) variables [K independent variables + 1 intercept \(\alpha\)] so \(RSS\) has the remaining degree of freedoms from data points \(N -(K+1)= N-K-1\)

76. Unbiasedness of OLS part 1

Note:

this part and the next are a different way to prove unbiasedness that we did in \(62-67\)

If we take multiple samples from population and estimate \(\beta^p\), we get a distribution of \(\beta^p\) that is unbiased on average or mathematically

\[ E[\hat \beta] = \beta^p \]

Using OLS, we know that

\[ \hat \beta = \dfrac{\sum (x_i- \bar x)y_i}{\sum(x_i - \bar x)^2} \]

And we can substitute value of \(y_i\)

\[ \hat \beta = \dfrac{\sum (x_i- \bar x)(\alpha^P+\beta^px_i + u_i)}{\sum(x_i - \bar x)^2} \]

Then distribute the nominator term

\[ \hat{\beta} = \dfrac{\alpha^P \sum (x_i - \bar{x}) + \beta^P \sum (x_i - \bar{x}) x_i + \sum (x_i - \bar{x}) u_i}{\sum (x_i - \bar{x})^2} \]

Next: we check first each term in the numerator

77. Unbiasedness of OLS part 2

We reached this form

\[ \hat{\beta} = \dfrac{\alpha^P \sum (x_i - \bar{x}) + \beta^P \sum (x_i - \bar{x}) x_i + \sum (x_i - \bar{x}) u_i}{\sum (x_i - \bar{x})^2} \]

We can further distribute the first term into

\[ \sum(x_i - \bar x)\alpha^p = \sum x_i \alpha^p - \bar x \alpha^p \sum 1 \]

and knowing that \(\bar x = \dfrac{\sum x_i}{N}\)

\[ \sum(x_i - \bar x)\alpha^p = N \bar x \alpha^p - N \bar x \alpha^p = 0 \]

Into second term, we know this magic trick

\[ \dfrac{\sum(x_i - \bar x)x_i}{S_{xx}} = \dfrac{\sum(x_i - \bar x)(x_i - \bar x)}{S_{xx}} = 1 \]

so second term = \(\beta^p\)

SO we get that

\[ \hat \beta = \beta^p + \dfrac{\sum(x_i - \bar x)u_i}{S_{xx}} \]

Taking the expectation:

\[ E[\hat \beta] = \beta^p + \dfrac{\sum E ([x_i - \bar x]u_i)}{s_{xx}} \]

And we have assumption \(E[u_i|x_i] = 0\) meaning they are independent so we can rewrite it as

\[ E[\hat \beta] = \beta^p + \dfrac{\sum [x_i - \bar x]E(u_i)}{s_{xx}} \]

Knowing that \(E(u_i) = 0\), the whole mess disappears

78. Variance of OLS <SE> Estimators part 1

If we are interested in test score explained by class size, a statistical software may get the equation

\[ TS = 50 - 10CS \]

where class room size is a dummy variable

\[ CS \begin{cases}1 &,30\\ 0 &, otherwise \end{cases} \]

Are these values due to sampling error or its really true?

If the true population is 0, and sampling distribution is wide, its easy to get -10 from sampling error.

But if the estimator is efficient (has few sampling variance), it can’t get -10 due to sampling error cuz the sampling distribution is narrow

So knowing the variance of the estimator is crucial to decide if its value is from sampling error or not

Note: we actually mean standard error which is the standard deviation of the sampling distribution

To calculate the standard error start with \(\hat \beta\)

\[ \hat \beta = \beta ^p + \dfrac{\sum(x_i - \bar x)u_i}{s_{xx}} \]

get variance knowing \(\beta ^p\) is constant so has variance 0 and covariance 0

\[ var(\hat \beta | x_i) = var\dfrac{\sum(x_i - \bar x)u_i}{s_{xx}} \]

and assuming no serial correlation, there is no covariance terms

\[ \boxed{var(\hat \beta | x_i) = \dfrac{\sum var[(x_i - \bar x)u_i]}{s^2_{xx}}} \]

79. Variance of OLS estimators part 2

Going further, we assume heteroscedasticity, \(var(u_i|x_i)=\sigma^2\)

\[ var(\hat \beta| x_i) = \dfrac{\sum (x_i - \bar x)^2 var(u_i)}{s^2_{xx}} \]

which is equal to

\[ \sigma^2 \dfrac{\sum (x_i - \bar x)^2}{(\sum (x_i - \bar x)^2)^2} \]

We can cancel the numerator with one of the denominator to get:

\[ \boxed{var(\hat \beta | x_i) = \dfrac{\sigma^2}{\sum (x_i - \bar x)^2}} \]

Of course, we don’t know \(\sigma^2\) so we estimate it

80. Estimator for the population error variance

If we regress y on x and u where \(u \sim N(0, \sigma^2)\)

then variance is

\[ Var(u) = E[(u-E(u))^2] = E[u^2]= \sigma^2 \]

Then we get point estimates for \(\alpha, \beta, \hat u\) where \(\hat u\) is the residuals

To estimate the population variance

\[ \tilde \sigma ^2 = \dfrac 1 N \sum \hat u_i^2 = \dfrac 1 N [\hat u_1^2 + \hat u_2^2 + \dots + \hat u_{N-1}^2 + \hat u_N^2] \]

and the variance of last two terms = 0, cuz we have \(N-2\) degree of freedoms due to the two constraints that we got to estimate \(\alpha, \beta\) which are

\[ \sum \hat u_i = 0, \sum x_i\hat u = 0 \]

So the expectation is

\[ \boxed{E[\tilde \sigma^2] = \dfrac{N-2}{N} \sigma^2} \]

The correct estimator divides by \(\dfrac{1}{N-2}\)

\[ \boxed{E[\hat\sigma^2] = \dfrac{1}{N-2} \sigma^2} \]

If we have k regressors, it will be \(\dfrac{1}{N-k}\)

81. Estimated variance of OLS estimators - intuition

We reached that

\[ \boxed{Var(\hat \beta|x_i) = \dfrac{\dfrac{1}{N-2}\sum \hat u_i^2}{\sum(x_i - \bar x)^2}} \]

This is the formula used by statistical packages

The denominator is variance of \(x\), as x varies more, denominator gets bigger, the whole value falls

\[ var(x)\uparrow, denom \uparrow, frac \downarrow \]

Why? for the denominator

Imagine a scatterplot where the x are clustered in a tiny area, you are not confident that your line is actually the best. Perhaps you can fit many lines in this small cluster

You get more confident as you have more spread in the \(x\) values

The nominator measures how much my model fits the data

82. Variance of OLS estimators in presence of heteroscedasticity

In the normal case

\[ \begin{align*} var(\hat \beta | x_i) &= \dfrac{\sum var[(x_i - \bar x)u_i]}{ssx^2}\\ &= \dfrac{\sigma^2 \sum (x_i - \bar x)^2}{ssx^2} \\ &= \dfrac{\sigma^2}{\sum (x_i - \bar x)^2}\\ &= \dfrac{\sigma^2}{ssx} \end{align*} \]

But what if there is heteroscedasticity?

\[ Var(u_i|x_i) = \sigma^2_i \]

So we estimate \(\sigma^2_i\) as \(\hat u_i^2\) <notice the \(i\)>

\[ \boxed{var(\hat \beta| x_i) = \dfrac{\sum (x_i - \bar x)^2 \hat u_i^2}{s^2_{xx}}} \]

This is called robust standard error or white standard error

83. Variance of OLS estimators in presence of serial correlation

again, this is the normal case

\[ var(\hat \beta | x_i) = \dfrac{\sigma^2}{\sum (x_i - \bar x)^2} \]

If we have serial correlation

\[ cov(u_t.u_{t+j}) = \rho^j \sigma^2 \]

which is \(AR(1)\) process (more on this later)

so variance becomes <assuming \(\bar x = 0\)>

\[ var(\hat \beta|x_t) = var(\dfrac{\sum(x_t - \bar x)u_t}{ss_{x}}) \]

We can expand it to get

\[ \dfrac{\sigma^2}{ssx}+ \dfrac{2 \sum_{t=1}^{T-1} \sum_{j=1}^{T-t} x_t x_{t+j} \, E[u_tu_{t+j}]}{ssx^2} \]

the \(E\) part is the covariance term cuz we have \(\bar u =0\). Finally, we replace the covariance with \(\sigma^2 \rho^j\)

\[ \dfrac{\sigma^2}{ssx}+ \dfrac{2\sigma^2 \sum_{t=1}^{T-1} \sum_{j=1}^{T-t} x_t x_{t+j} \rho^j}{ssx^2} \]

What does the second term mean?

if we plot the error with respect to time, we will have runs of positives and negatives or in other words the \(\rho\) is greater than zero in the positive runs so the whole term is \(>0\)

Summary:

if we use the normal standard errors without taking into account the covariance term, we will be under estimating the value

84. Gauss Markov conditions summary of problems of violation

Assumption Bias Inefficiency Other
Zero conditional
mean of errors
No perfect
collinearity can’t estimate
No serial
correlation errors ✅in case of lag
dependent can’t rely on
normal SE
Homoscedastic
errors can’t rely on
normal SE

85. Estimating population variance from a sample part 1 - Bessel correction

If we have a population of \(x_i\), with with normal error term

\[ x_i = \mu + \epsilon_i \]

then we take a sample and estimate \(\mu \text{ by }\bar x\) which is an unbiased estimator

\[ E[\bar x] = \dfrac 1 N \sum E[x_i]= \dfrac 1 N N \mu = \mu \]

as for \(\sigma^2\), we know from degrees of freedom part that

\[ \tilde \sigma^2 = \dfrac 1 N \sum (x_i - \bar x)^2 = \dfrac 1 N [(x_1 - \bar x)^2 + \dots + (x_N - \bar x)^2] \]

and by taking expectations

\[ E[\tilde \sigma^2] = \dfrac 1 N [\sigma^2 + \sigma^2 + \dots + 0] \]

cuz we used one degree of freedom to estimate \(\bar x\)

\[ E[\tilde \sigma^2] = \dfrac{N-1}{N} \sigma^2 \neq \sigma^2 \]

which is biased but consistent cuz as \(N\) goes to infinity, \(\dfrac{N-1}{N} \to 1\) this is why its consistent

The unbiased estimator is

\[ \hat \sigma^2 = \dfrac{N}{N-1} \tilde \sigma^2 \]

cuz

\[ E[\hat \sigma^2] =\dfrac {N}{N-1} \left [ \dfrac {N-1} {N}\sigma^2 \right] = \sigma ^2 \]

This is called Bessel’s correction

86. Estimating population variance from a sample part 2

Another way to write the unbiased estimator is

\[ \hat \sigma^2 = \dfrac{1}{N-1}\sum(x_i - \bar x)^2 \]

think of sample mean as least square estimator. Let the cost function \(C\) be

\[ C = \sum(x_i - \tilde x)^2 \]

To minimize the sum, we differentiate

\[ \dfrac{\delta c}{\delta x} = -2 \sum (x_i - \tilde x) = 0 \]

so to be minimal, \(\tilde x\) has to be \(\bar x\)

\[ \tilde x = \bar x = \dfrac{1}{N} \sum x_i \]

In general, \(\bar x \neq \mu\) cuz \(\bar x\) minimizes the deviation more than any other estimator, so its always the case that

\[ \sum (x_i - \bar x)^2 \le \sum (x_i - \mu)^2 \]

If we calculate \(\bar \sigma^2\) using the \(\mu\) part, it will be unbiased

But using the \(\bar x\) instead, its expectation is less than true population variance

Extra: Example

\[ \begin{align*} \mu = \dfrac{2+4+6}{3} = 4\\ (2-4)^2 + (4-4)^2 + (6-4)^2 = 8\\ \bar x = \dfrac{2+4}{2} = 3\\ (2-3)^2+(4-3)^2 = 2 \end{align*} \]

The sum of squared deviations using \(\bar x\) is \(\le\) \(\mu\)

87. Problem set 2

Practical: factors affecting NBA wages

Theory: issues of heteroscedasticity, multicollinearity, endogeneity