Gauss-Markov assumptions
31. Gauss-Markov assumptions 1
Gauss Markov assumptions if met, make least squares BLUE, aka best estimators out there.
The assumptions are:
population process is linear in parameter
parameters like \(\alpha, \beta\), but the variable can be non linear like \(\text{educ}^2\)
x and y are a random sample
each individual is equally likely to be picked and come from the same population
zero conditional mean of error
\(E(u_i|x) = 0\)
If I know level of education, I can’t predict if his wage is above average of people with same year of education or below it
32. Gauss-Markov assumptions 2
The other assumptions:
No perfect collinearity in regressors
\(x_1\) can’t be a linear function of \(x_2\) like \(x_1 = \delta_0 + \delta_1x_2\)
can’t use meters and kilo meters in the same regression
Homoscedastic errors
width of scatter plot is constant
\[ Var(u_i|X_i) = \sigma^2 \]
No serial correlation
\[ cov(u_i,u_j) = 0 \]
knowing one error will not help me know the next error
33. Zero conditional mean of errors
What do we mean by zero conditional mean assumption?
\[ \boxed{E[u_i|x_i] =0} \]
If this assumption is not met, the estimator is biased
\[ E[\hat \beta | x_i] \neq \beta^p \]
Meaning, in the sampling distribution, it will not be centered around beta of population
We can write this assumption in another way
\[ \boxed{cov(u_i,x_i)=0} \]
aka no relationship between the independent variable and error term
34. Omitted variable bias example 1
omitted variable bias will ruin the zero conditional mean assumption \(E[u_i|x_i]\neq 0\)
Example: how class room size affects test score? we think the relation is negative, aka \(\beta\) < 0 (bigger class size leads to teacher focusing less on each student)
\[ TS = \alpha + \beta \, CS + u \]
we use OLS to estimate the \(\beta\), but it will be biased
why?
many omitted variables in \(u\) term are correlated with class size and test score, like funding
more funding \(\to\) less class size, negative relationship
more funding \(\to\) better books = better test score, positive effect
so the assumption is not met \(cov(u, CS) \neq 0\)
if \(\tilde \beta\)= -10, class size is taking credit away from funding, so this is over estimation. real \(\beta = -5\)
Why overestimation not underestimation?
product of the effect and relationship is negative \(+\cdot - = -\)
cuz beta is already negative, we multiply by \((-1)\)
\[ \hat \beta_1 = \beta_1 - \text{bias}\\ -10 = -5-5\\ 10 = 5+5 \]
35. Omitted variable bias example 2
\[ wages = \alpha + \beta_1 \, educ + \beta_2 \, ability + u \]
wages is a function of education and ability, but we don’t have ability data so we use education only
\[ wages = \alpha + \beta_1 \, educ + v \]
our error became bigger cuz it includes the omitted ability
It is known that people with more ability take more education, so there is a positive correlation
so zero conditional mean error condition is not satisfied, estimator is biased
\[ cov(v, educ) \neq 0 \]
\(\beta_1\) will take extra credit from \(\beta_2\), so its over estimated aka upwardly biased \(+\cdot + = +\)
36. Omitted variable bias example 3
If we see how IQ is affected by being an African or nope, aka a dummy variable
\[ Africa = \begin{cases} 1, &African \\ 0, &Otherwise \end{cases} \]
we get the regression equation
\[ IQ = \alpha + \beta \, \text{Africa} +u \]
an omitted variable is level of education, which is negatively correlated with Africa. beta is underestimated \(+ \cdot - = -\)
\(\beta^p = 0, \tilde \beta = -10\)
Here is a rule:
- If the product of the two relationships (correlation and effect) is positive:
- The bias is positive (upward bias).
- The estimated coefficient (β) is overestimated (too large in magnitude, either too positive or too negative).
- If the product of the two relationships is negative:
- The bias is negative (downward bias).
- The estimated coefficient (β) is underestimated (too small in magnitude, either too close to zero or too negative).
37. Omitted variable bias proof part 1
wage is affected by education. we estimate the effect from a sample, and omit a variable like ability
ability and education: positive relation,
ability and wage: positive relation
\[ wages = \alpha + \beta_1 \, educ + \beta_2 \, ability + v_i \]
and the omitted ability leads to
\[ wages = \alpha + \beta_1 \, educ +u \]
remember \(\hat \beta\) formula?
\[ \begin{align*} \hat \beta &= \dfrac{cov(y,x)}{var(x)} \\&= \dfrac{cov(\beta x+u,x)}{var(x)} \\&= \dfrac{\beta cov(x,x)}{var(x)} + \dfrac{cov(u,x)}{var(x)} \\&= \beta + \dfrac{cov(u,x)}{var(x)}\\ &= \beta + \dfrac{\sum(educ_i-\bar{ \text{educ}})u_i}{\sum(educ_i-\bar{ \text{educ}})^2} \end{align*} \]
38. Omitted variable bias proof part 2
Continuing the last function, we substitute \(u_i\) with \(\beta_2 \text{ ability} + v_i\) then we take the expectation
$$ \[\begin{align*} \hat \beta_1 &= \beta^P_1 + \dfrac{\sum(educ_i-\bar{ \text{educ}})u_i}{\sum(educ_i-\bar{ \text{educ}})^2}\\ &= \beta_1^P + \dfrac{\sum(educ_i-\bar{ \text{educ}})(\beta_2 \, abil + v_i)}{S^2_{educ}}\\ \end{align*}\] $$
Taking expectation
\[ E[\hat \beta_1] = \beta_1^P + \dfrac{ \beta_2\sum E[(educ_i-\bar{ \text{educ}})abil]}{S^2_{educ}} \]
we forgot about \(v_i\) cuz we assume its perpendicular on \(u_i\)
since education and ability is positive, and \(\beta_2\) is also positive, and can be written as
\[ \boxed{E[\hat \beta_1] = \beta_1^P + + \beta_2 \, bias^2} \]
estimator is upward biased
\[ E[\hat \beta_1]> \beta_1^P \]
39. Reverse causality part 1
If zero conditional mean of errors is not met, we have endogeneity aka x is endogenous <there is a relation between x and \(u\)>, beta is biased.
\[ E[u_i|x_i] \neq 0 \]
The econometrician Eli Berman investigated Iraq war
He was interested on level of violence with respect to development,
\[ violence = \alpha + \beta \, development + u_i \]
****the relation appears to be positive which is wrong
high violence get more development spending.
\[ development = \alpha + \beta \, violence + u_i \]
so if we get from the first model that \(\hat \beta = 5\%\) then the population can be \(\hat \beta = -5\%\) aka we have reverse causality
Solution: use instrumental variable
40. Reverse causality part 2
why reverse causality leads to \(E[u_i|x_i] \neq 0\)
(HDI) is a composite statistic used to rank countries based on their overall achievements in health, education, and income.
we expect civil war to decrease HDI, but having low HDI will cause civil war too
\[ \begin{align*} HDI &= \alpha + \beta CW + u\\ CW &= \delta + \gamma HDI + v \end{align*} \]
to see the problem, state the conditional expectation as covariance of \(u_i,CW\) where \(CW\) is a dummy variable
\[ \begin{align*} cov(u,cw)&= cov(u, \delta+\gamma HDI+v) \\&= \gamma cov(u, HDI)\\&= \gamma cov(u,\alpha + \beta CW+ u)\\&= \dots + \gamma var(u) \\&\neq 0 \end{align*} \]
41. Measurement error in independent variable part 1
Endogeneity means \(E[u_i|x_i] \neq 0\)
we can get endogeneity by measurement error
Example:
over time, we measure company advertising and sales
There is reverse causality but ignore it for now, and true \(\beta\) is 10
\[ sales_t = \alpha + \beta A_t + v \]
But what if we measure level of advertising wrong?
we get the measure M = advertising + error, \(\beta\) becomes weaker and outputs 7, if error increases? \(\beta\) can be zero
As measuring error increases, bias increases
42. Measurement error in independent variable part 2
why measurement errors lead to zero conditional mean error?
back to sales function of advertising, we measure advertising as \(M\) = A + error cuz we don’t have exact data of advertising
substitute M in the equation to take covariance
\[ \begin{align*} S_t &= \alpha + \beta A_t + u_t\\ M_t &= A_t + v_t\\ S_t &= \alpha + \beta(M_t - v_t)+u_t\\ S_t &= \alpha + \beta M_t + (u_t-\beta v_t)\\ cov(u_t-\beta v_t,M_t) &= - \beta cov(v_t,A_+v_t)\\ &= - \beta cov(v_t,v_t)\\ &= -\beta \sigma^2_v\\ &\neq 0 \end{align*} \]
43. Functional misspecification 1
people start with few salaries then get biggest wage in forties and fifties then wage starts to decline, so there is an inverted \(U\) shape so maybe the relation is
\[ \text{wages} = \alpha + \beta_1 \text{age} + \beta_2 \text{age}^2 + u \]
\[ \beta_1 > 0 \qquad \beta_2 <0 \]
if we plot it, it will be an invertef \(U\) shape, so fitting a straight line is a bad idea, straight line means we did not add \(\text{age}^2\), to make the line curvy, we add it
Note: functional misspecification is a form of omitted variable bias
Note2: there is no perfect collinearity, cuz they don’t have linear relationship
44. Functional misspecification 2
How car sales are affected by price?
\[ CS = AP^\beta e^u \]
beta here is less than zero, indicating that
\[ p \to \infty \quad cs \to 0 \]
This indicates that when a price is low, a small change will do a big change in demand, and if car has high price, low change will almost have no effect
If we have the dataset thrown in a scatterplot, and we fit a straight line, we will be biased due to functional misspecification
Notice that cs function is not linear in parameter, so to use this function, we take log
\[ \ln CS = \ln A + \beta \ln P + u \]
\(\beta\) here represents elasticity, percentage change in demand with respect to percentage change in price
45. Linearity in parameters
One of the assumptions is to be linear in parameters, all of the following are linear in parameters even if they are not linear in independent variable
\[ y = \alpha + \beta x\\ y = \alpha + \beta x^2\\ \ln y = \alpha + \beta \ln x \]
A model that will violate this assumption is
\[ y =( \alpha+x)^2 \]
cuz when we expand it, we get
\[ y = \alpha^2 + 2\alpha x + x^2 \]
46. Random sample summary
Mathematical definition:
If we have many random variables \({y_1,\dots,y_n}\), that are independent and have common pdf, then this is a random sample
Example:
\[ \text{wages} = \alpha + \beta \text{educ}+u \]
we choose individuals who have wage and education, if I choose a person with high education, It doesn’t change the probability of another individual
Notice:
If we have two populations where population A is represented as
\[ \text{wages} = \alpha + \beta \text{educ}+u \]
and Population B is represented as
\[ \text{wages} = \alpha + \beta_1 \text{educ}+ \beta_2 \text{ability}+u \]
I can’t take a random sample by selecting individuals from both populations
Another violation will be if we focus on one area instead of the whole population
47. Explanation of random sampling and serial correlation
If we have random sample then we have no serial correlation,
so why state the two assumptions and not just one?
cuz time series has no random sample so we will be dealing with non random samples and series correlations later on.
Proof of random sampling means no serial correlation:
We have a population with a function
\[ y = \alpha + \beta x + u \]
we get N individuals, who are \(iid\), we write the independence mathematically as
$$ E[y_iy_j] = E[y_i]E[y_j] ij\
$$
Using independence, we deduce that
\[ cov(y_i,y_j) = E[y_iy_j]-E[y_i]E[y_j] = 0 \]
If they are independent, then covariance is zero
we can expand the covariance, but take into account that \(x,u\) are uncorrelated by zero conditional mean assumption \(E[u_i|x_i]=0\)
lets expand
\[ \begin{align*} cov(\alpha +\beta x_i+u_i, \alpha + \beta x_j + u_j )\\= \beta^2cov(x_i,x_j)+ cov(u_i,u_j)\\= 0 \end{align*} \]
48. Serial correlation summary
serial correlation means
\[ cov(u_i,u_s)\neq 0 \quad i\neq s \]
aka there is a relationship between the error terms
If we have serial correlation, OLS is no longer best, There are other linear unbiased estimators with smaller sampling variance
To get serial correlation:
- omit an important variable
- its systematic error, in model not population
- functional misspecification
- still not true serial correlation, we used wrong wrong model so we have systematic error, this will lead my model to be biased
- measurement error in independent variable
- will make my model biased
Note: GLS, FGLS can make BLUE estimators, like cochrane-orcutt, praise-winston
49. Serial correlation as a symptom of omitted variable bias
If we model farmers income over time, There is a pattern of ups and downs, our model that we fit is a straight line, so we get runs of ups and downs.
Why?
cuz we did not include an important variable: amount of rain which also has ups and downs that explain the ups and downs in the income
Another example:
sales as a function of advertising and seasonality
our model will have ups and downs due to inclusion of seasonality but may not explain it the best
why?
cuz we dropped an important factor: price
50. Serial correlation as a symptom of functional misspecification
wage as a function of age,
we start with few wage then it grows until we get so old, wage decreases again.
so there is an inverted \(U\) shape, a straight line is not representative and we get runs of ups and downs
Remember: serial correlation is due to runs of ups and downs
To solve the problem, just add \(\text{age}^2\)
51. Serial correlation caused by measurement error
Farmer income with respect to time and rain,
we know amount of rain using a weather station but it is not accurate, it can’t capture the high levels of rain (censored)
Our model will underpredict the income and get serial correlation.
52. Serial correlation biased standard error part 1
Serial correlation can be caused by clustering
Example
\[ TS_{ig} = \alpha + \beta CS_g + \epsilon_{ig} \]
How test score is affected by classroom size, we know that \(\beta<0\) (as size increases, teacher focus per student decreases), and we know that there is error term
The error term consists of two errors, difference in classes <teachers, books etc.> and difference in students
\[ \epsilon_{ig} = v_g + \eta_i \]
\(v_g\) consists of students in class that have one teacher aka clustering, so two individuals \(i,j\) will be correlated
\[ cov(\epsilon_{i},\epsilon_j)= \sigma^2_v \]
variance is due to teacher
53. Serial correlation biased standard error part 2
Back to our test score problem, we said the error terms consists of two errors based on students and teachers variability
\[ TS_{ig} = \alpha + \beta CS_g + \epsilon_{ig} \]
If we model our data, we get \(\beta =-10\) and we get standard error \(SE=1.5\) which we will use later on for inference
But the problem is: programs assume GM assumptions are met, but we have serial correlation, so the true standard error may be \(SE = 5\). This big difference will make all the inference wrong
54. Heteroskedasticity summary
by homoscedasticity, we assume variance of the error term is constant
\[ var(u_i|X_i) = \sigma^2 \]
Visually, we can fit the data between two parallel lines
If this assumptions is not met, we have heteroskedasticity, aka variance of error terms increase or decrease as x change ****
\[ var(u_i|X_i) = f(x_i) \]
why bother?
if its violated, estimators are no longer best
Why?
cuz model is lacking information that explains the change in variance
55. Heteroskedastic errors example 1
How wage depends on education?
\[ wage = \alpha + \beta \, educ + u \]
we expect a linear relationship and \(\beta>0\).
But, as education increases, the options of jobs increase, he can be a rich banker or poor academic
Fitting a straight line will show that errors variance increase with education
\(\beta\) is unbiased cuz it can predict the average case, our problem is with the variance
\[ Var(u_i|educ_i) = \sigma^2 educ \]
However
\[ E[u_i|educ_i]=0 \]
Other estimators are WLS, GLS
56. Heteroskedastic errors example 2
we see percentage spending on food with respect to income
\[ PSF = \alpha + \beta \, inc + u \]
As individual gets richer, the percentage becomes wider as he can be foody or does not like expensive food.
However, if he is poor, he will spend his whole money on food. The richer you are, the more choices you have
57. Heteroskedasticity caused by data aggregation
aggregation means grouping
We see how test score is affected by parental income
\[ TS = \alpha + \beta \, pinc + u_i \]
if parents are rich, student has more computers and books, we expect positive relationship \(\beta>0\).
We plot the data for \(10,000\) individuals. But there is a big constant variability in data So we group individuals and get 500 groups to decrease the variability
\[ \bar{TS} = \alpha + \beta \, \bar{pinc} + \bar u_g \]
No free lunch in econometrics
\(\beta\) becomes not efficient
Heteroskedasticity occurs cuz error is defined as arithmetic mean of each group. For example, groups \(g,f\)
\[ \bar u_g = \dfrac {1} {n_g} (u_{1g} + u_{2g}+\dots + u_{ng}) \]
\[ \bar u_f = \dfrac {1} {n_f} (u_{1f} + u_{2f}+\dots + u_{nf}) \]
so error depends on group and each group has its own variance
\[ var(\bar u_g)\neq var(\bar u_f) \]
58. Perfect collinearity example 1
Finally, the last assumption. no perfect collinearity.
We can’t add square foot and square meter in the same model cuz they measure the same thing
\[ SQF = g*SQM \]
\(\beta\) will not have a value, any program will return an error message
why?
remember that regression is just partial differentiation or partial effect of a variable while holding other factors fixed. If the two variables measure anything, partial differentiation will fail
Another example:
consumption with respect to non labor income, salary, total income
\[ C_i = \alpha + \beta_1 \, NL + \beta_2 \, salary + \beta_3 \, TI + u \]
Notice that non labor income + salary = total income aka exact relationship
\[ TI = NL + salary \]
59. Perfect collinearity example 2
we see test score with respect to size of home and distance from country center
\[ TS = \alpha + \beta_1 \, size + \beta_2 \, country + u_i \]
both are dummy variables, size takes 1 if home has more than 2 bedroom, 0 otherwise,
country takes 1 if close from center, zero otherwise
If we select rich people only, country and size are always 1 so dummy variables caused perfect collinearity
60. Multicollinearity
multicollinearity means that \(x_1,x_2\) have a relationship but not a perfect one.
Example:
sales as a function of Tv advertising and radio advertising
\[ Sales = \alpha + \beta_1 \, TV + \beta_2 \, Radio + u \]
We used Tv and radio at the same time, but spent more on TV
If we check the relation between TV and Radio, we find a positive relationship between them. Regression will not be able to distinguish effect of each variable
Result of multicollinearity:
\(R^2\) is big like 0.9, but individually, each item is not significant, betas have huge standard errors
61. Index- where we currently are
We have a sample that we use OLS to estimate the population. OLS estimators are BLUE when G_M assumptions are met
We do diagnostic tests to see if the assumptions are met or not
If one or more of the assumptions are not met, we use other ways like GLS,ML,IV
Next: proof of Gauss Markov assumptions
62. Gauss-Markov proof part 1
We will proof that the assumptions if met, make the estimator BLUE
The equation is
\[ y =\alpha + \beta x + u \]
And we know that \(\hat \beta\) is
\[ \hat \beta = \dfrac{\sum(x_i-\bar x)y_i}{\sum(x_i-\bar x)^2} = \sum v_iy_i \]
For simplification, we write the mess on the left as \(\sum v_i\)
Notice that this is the reason we assume linearity, we calculate linear sum of \(v_i\)
\[ \boxed{v_i =\dfrac{x_i-\bar x}{s_{xx}}} \]
And it has the following properties
\[ \boxed{\sum v_i = 0} \]
cuz of the numerator, when we sum it we get
\[ \sum(x_i - \bar x) = n \bar x - n \bar x = 0 \]
and if we square \(v_i\) we get variance over variance squared
\[ \boxed{\sum v_i^2 = \dfrac{\sum(x_i - \bar x)^2}{s^2_{xx}} = \dfrac 1 s_{xx}>0} \]
63. Gauss-Markov proof part 2
- we will prove first that \(\hat \beta\) is unbiased using zero conditional mean assumption.
- Then we get variance and assume no serial correlation, homoscedasticity
- Finally, we make a new beta \(\tilde \beta\) and prove that its unbiased and get its variance, and prove that \(\hat \beta\) is efficient by showing \(var(\tilde \beta) \ge var(\hat \beta)\)
Lets do the unbiased thing
\[ \begin{align*} \hat \beta &= \sum v_iy_i\\ &= \sum v_i(\alpha + \beta x + u_i)\\ &= \alpha \sum v_i + \beta \sum v_i x_i + \sum v_iu_i\\ &= 0 + \dfrac{\sum(x_i-\bar x)x_i}{s_{xx}} + \sum v_iu_i\\ &=0 +\beta \dfrac{\sum (x_i - \bar x)^2}{\sum (x_i - \bar x)^2} + \sum v_i u_i\\ &= 0 + 1 \cdot \beta + \sum v_i u_i\\ &= \beta + \sum v_i u_i\\ \end{align*} \]
So for \(\beta\) to be unbiased, we need \(\sum v_i u_i\) to disappear in the expectation to do so, we assume that \(v_i\) and \(u_i\) are independent and expect \(u_i\) to be iid
\[ \begin{align*} E[\hat \beta] &= \beta + \sum E[v_i u_i]\\ &= \beta + \sum v_i E[u_i]\\ &= \beta + 0 \end{align*} \]
Hence we use zero conditional mean assumption
\[ E(u|x_i) = 0 \]
64. Gauss-Markov proof part 3
We proved that \(\beta\) is unbiased using zero conditional mean, now we derive variance assuming no serial correlation, homoscedasticity
We know the formula for \(\hat \beta\)
\[ \hat \beta_{LS} = \beta + \sum v_iu_i \]
Then we take variance
\[ var(\hat \beta_{LS}) = var(\beta + \sum v_iu_i) \]
Since that \(\beta\) is a number with no variance, it disappears. eg: variance of 10 = 0
and we have no serial correlation
\[ var(\hat \beta_{LS}) = var(\sum v_iu_i) = \sum var(v_iu_i) \]
we have no covariance term due to the assumption of serial correlation.
Finally, we assume homoscedasticity, so variance of \(u_i\) is a constant \(\sigma^2\) and \(v_i\) is a number so it goes out of variance squared
\[ \boxed{var(\hat \beta_{ls}) = \sum v_i^2 \sigma^2 = \sigma^2 v_i^2 = \sigma^2 \mu} \]
Next time, we do \(\tilde \beta\) that is unbiased and compare the variance
65. Gauss-Markov proof part 4
Lets do \(\tilde \beta\) to prove that \(\hat \beta\) is more efficient,
let \(b_i\) be some kind of weighting
\[ \begin{align*} \tilde \beta &= \sum b_iy_i\\ &= \sum b_i (\alpha + \beta x_i + u_i)\\ &= \alpha \sum b_i + \beta \sum b_i x_i + \sum b_i u_i \end{align*} \]
Then we take its expectation, last term disappears due to zero conditional mean assumption
\[ \begin{align*} E[\tilde \beta] = \alpha \sum b_i + \beta \sum b_i x_i + 0 \end{align*} \]
To have the expectation \(E [\tilde \beta] = \beta\) we must have these conditions
\[ \sum b_i = 0, \sum b_i x_i = 1 \]
To get
\[ \begin{align*} E[\tilde \beta] = \alpha \cdot 0 + \beta \cdot 1 + 0 = \beta \end{align*} \]
66. Gauss-Markov proof part 5
If we apply the constraints used for \(\tilde \beta\) to be unbiased, we get that
\[ \tilde \beta = \beta + \sum b_iu_i \]
Lets assume that \(b_i\) weights is same as least squares weights \(v_i\) + some difference \(c_i\)
\[ b_i = v_i + c_i \]
Then we substitute
\[ \tilde \beta = \beta + \sum(v_i + c_i)u_i \]
then take the variance
\[ var(\tilde \beta) = \sigma^2 \sum(v_i+c_i)^2 \]
Lets expand the terms
\[ var(\tilde \beta) = \sigma^2(\sum v_i^2 + 2 \sum v_ic_i+ \sum c_i^2) \]
using the \(b_i\) constraints that made it unbiased, we get that
\[ \sum(v_i +c_i) = 0 \to \sum c_i=0 \]
and
\[ \sum(v_i +c_i)x_i = \sum v_i x_i + \sum c_ix_i = 1 \to \sum c_ix_i = 0 \]
The above expression has to sum to one for \(\tilde \beta\) to be unbiased and we know from ols that
\[ \sum v_ix_i = 1 \]
67. Gauss-Markov proof part 6
We have variance of \(\tilde \beta\) and conditions, if we expand \(\sum v_ic_i\) we get zero from the two conditions <\(\sum c_i = 0, \sum c_ix_i=0\)>
\[ \sum \dfrac{x_i - \bar x}{s_{xx}}c_i = \dfrac 1 s_{xx} \sum x_ic_i - \dfrac{\bar x}{s_{xx}}\sum c_i= 0 \]
Hence the variance is
\[ var(\tilde \beta) = \sigma^2 \sum v_i^2 + \sigma^2 \sum c_i^2 \]
while variance of ols is
\[ var(\hat \beta_{LS}) = \sigma^2 \sum v_i^2 \]
and knowing that \(c_i^2\) is positive number then we reach that
\[ var (\tilde \beta)\ge var(\hat \beta) \]
so \(\hat \beta\) is the best linear unbiased estimator
68. Errors in population vs estimated errors
There is a process that creates the variables in the population. For example, in the population we have
\[ wages = \alpha + \beta_1\, educ + \beta_2 \, ability + u \]
Of course the process is a linear combination here. We expect the error term to be \(iid(0, \sigma^2)\)
Taking expectation, the error term disappears. Cuz people with same education and ability can be university lecturers or investment bankers aka below average or above it.
\[ E[wages|educ, abil] = \alpha + \beta_1 \, educ + \beta_2 \, abil \]
In econometrics, we have a sample not a population, so when we fit OLS, we estimate the error term \(\hat u\) and call it residuals.
\[ wages = \alpha + \beta_1\, educ +\hat u \]
Its expectation is not the error term, cuz we lack other terms included in the population process
\[ E[\hat u] \neq u \]
69. Sum of squares
So what is TSS - ESS - RSS?
We have a scatterplot of \(y_i\) and \(x_i\). we can draw a line at the mean \(\bar y\)
\[ TSS = \sum(y_i - \bar y)^2 \]
Then we fit our Least squares line, and we get \(\hat y\) then see the vertical distance between \(\hat y\) and \(\bar y\). This is explained sum of squares ESS. And it means how much of the variation in \(y\) that we can explain
\[ ESS = \sum(\hat y - \bar y)^2 \]
Our model can’t fit through every point perfectly, so \(ESS \le TSS\), the ratio is \(\le 1\) cuz of the remaining part: difference between \(\hat y\) and \(y\)
\[ RSS = \sum(y -\hat y)^2 \]
Intuitively:
\[ TSS = ESS+RSS \]
Remember: the distance between \(y , \bar y\) is divided into distance between
- \(\bar y, \hat y\)
- \(\hat y, y\)
so \(\hat y\) is usually in the middle, This is due to regression towards the mean phenomenon and \(\hat y = \bar y\) at the point \(\bar x\)
70. R squared part 1
Back to our scatterplot, we try to explain the deviation around \(\bar y\) by fitting a line, the distance we explain is the distance between \(\bar y, \hat y\) aka called ESS.
The problem with ESS is that it has units squared, if we measure distance in meters, ESS will be 10 meters squared for example and we have no idea if this is big or not.
Solution:
We do a ratio of the explained from total called R squared
\[ R^2 = \dfrac{ESS}{TSS} \qquad 0\le R^2<1 \]
Max values for \(R^2\)
- \(0\) if we estimate that \(\hat y = \bar y\) aka \(ESS = 0\)
- 1 if line passes through each point \(\hat y = y\) aka \(ESS = TSS\)
Note: we don’t depend on R squared cuz it has serious problems
71. R squared part 2
Why don’t we count on R squared?
If we have the model
\[ y = \alpha + \beta_1 \, x_1 + u \]
And calculate \(R^2 = 0.65\) which is interpreted as:
65% of the variation in \(y\) is explained by the model
Then we do a new model with extra variables
\[ y = \alpha + \beta_1 \, x_1 + \beta_2 \, x_2+u \]
\(R^2\) must increase.
If we add more variables, \(R^2\) will approach 1, even if the variables are rubbish.
\[ y = \alpha + \beta_1 \, x_1 + \beta_2 \, x_2 \,+ \dots + \beta_p \, x_p +u \]
R squared can’t be used to compare models. Cuz adding more variables makes the line more wiggly and pass through all the points.
72. Degrees of freedom part 1
Scatter plot again, we know the equation of a
\[ y = mx+c \]
if I can choose any numbers for \(m,c\), I can draw infinite lines. cuz I have 2 degrees of freedom, I am free to vary two variables \(m,c\) as I like
If I limit it to
\[ y = mx+2 \]
I can draw infinitely many lines that must pass through the intercept = 2, I have 1 degrees of freedom (I am free to vary \(m\) only)
If the constraint is on \(m\)
\[ y = 5x+c \]
I can draw infinitely many lines parallel to each other with slope = 5, aka 1 degree of freedom
Finally, If I have constraints on both of them
\[ y = 5x+2 \]
I have zero degrees of freedom, only one line satisfies this equation
73. Degrees of freedom part 2
If we have three observations of \(x\), and we know the mean \(\bar x = 2\)
We have infinitely many ways to get \(\bar x = 2\) using \(x_1,x_2,x_3\), however, you may think that you have three degrees of freedom, but you are wrong
If we choose \(x_1 =1\) and \(x_2 = 1\), \(x_3\) has to be equal to 4 so we get \(\bar x = 2\). Meaning that we have 2 degrees of freedom
In general, This situation has \(N-1\) degrees of freedom
Why bother?
If we have a sample from population and try to estimate the variance, we may be tempted to say
\[ S^2_N = \dfrac 1 N \sum (x_i- \bar x)^2 \]
but this is a biased estimator, why?
If we assume \(\bar x = 0\), we can rewrite sample variance as
\[ S^2_N = \dfrac 1N (x_1^2 + x^2_2 + \dots + x_N^2) \]
each of the \(x's\) has an expected variance of \(\sigma^2\). Except the last term has a variance of 0, to satisfy \(\bar x = 0\) (meaning last term is not free to vary so its a constant number with variance 0), we have \(N-1\) degrees of freedom
\[ E[S_N^2] = \dfrac{N-1}{N} \sigma^2 \]
The sample variance underestimates population bias, we fix it by Bessel correction
74. Overfitting in econometrics
If we are trying to explain variations in house prices with respect to square feet, plot it on scatter plot, fit a straight line, then we can predict house price with square feet = 500. This model has for example \(R^2 = 0.55\)
But maybe a squared line is a better fit than straight line, and has higher\(R^2\) \(=0.65\)
If we add millions of variables, we fit every point perfectly,
\[ HP = \alpha + \beta_1 \, SQF + \beta_2 \, SQF + \dots + \beta_p \, SQF^p + u \]
but this model is bad cuz it fits the noise not the signal. This is called overfitting, prediction is all wrong although \(R^2 = 1\). And will do horribly if \(x\) is outside the sample range
Hence, we need Adjusted R Squared \(\bar R^2\)
75. Adjusted R squared
To fix the issues with \(R^2\) we have adjusted R Squared, \(\bar{ R^2}\) which is
\[ \boxed{\bar{R^2} = 1 - \dfrac{(1-R^2)(N-1)}{N-K-1}} \]
What does this mess mean?
\(N:\) number of data points
\(K:\) number of independent variables
\(N\ge K\)
If \(K\) increases, \(N-K-1\) decreases, so ratio gets bigger, and since we subtract by it, the overall value decreases.
\[ k \uparrow, N-K-1 \downarrow, frac \uparrow, \bar R^2 \downarrow \]
Interpretation:
if there is no significant increase in \(R^2\) by adding variables, the \(\bar {R^2}\) falls
Example:
\[ \begin{align*} wages &= \alpha + \beta_1 educ + u\\ wages &= \alpha + \beta_1 educ + \beta_2 \, right~ handed + u \end{align*} \]
We expect \(\bar {R^2}\) to fall in the second model and choose the first model
Extra part:
why \((N-1)\) and \((N-K-1)\)?
when we calculate \(TSS\) we estimate \(\bar x\) so we have \(N-1\) degree of freedoms
and when we calculate \(ESS\), we estimate \((K+1)\) variables [K independent variables + 1 intercept \(\alpha\)] so \(RSS\) has the remaining degree of freedoms from data points \(N -(K+1)= N-K-1\)
76. Unbiasedness of OLS part 1
Note:
this part and the next are a different way to prove unbiasedness that we did in \(62-67\)
If we take multiple samples from population and estimate \(\beta^p\), we get a distribution of \(\beta^p\) that is unbiased on average or mathematically
\[ E[\hat \beta] = \beta^p \]
Using OLS, we know that
\[ \hat \beta = \dfrac{\sum (x_i- \bar x)y_i}{\sum(x_i - \bar x)^2} \]
And we can substitute value of \(y_i\)
\[ \hat \beta = \dfrac{\sum (x_i- \bar x)(\alpha^P+\beta^px_i + u_i)}{\sum(x_i - \bar x)^2} \]
Then distribute the nominator term
\[ \hat{\beta} = \dfrac{\alpha^P \sum (x_i - \bar{x}) + \beta^P \sum (x_i - \bar{x}) x_i + \sum (x_i - \bar{x}) u_i}{\sum (x_i - \bar{x})^2} \]
Next: we check first each term in the numerator
77. Unbiasedness of OLS part 2
We reached this form
\[ \hat{\beta} = \dfrac{\alpha^P \sum (x_i - \bar{x}) + \beta^P \sum (x_i - \bar{x}) x_i + \sum (x_i - \bar{x}) u_i}{\sum (x_i - \bar{x})^2} \]
We can further distribute the first term into
\[ \sum(x_i - \bar x)\alpha^p = \sum x_i \alpha^p - \bar x \alpha^p \sum 1 \]
and knowing that \(\bar x = \dfrac{\sum x_i}{N}\)
\[ \sum(x_i - \bar x)\alpha^p = N \bar x \alpha^p - N \bar x \alpha^p = 0 \]
Into second term, we know this magic trick
\[ \dfrac{\sum(x_i - \bar x)x_i}{S_{xx}} = \dfrac{\sum(x_i - \bar x)(x_i - \bar x)}{S_{xx}} = 1 \]
so second term = \(\beta^p\)
SO we get that
\[ \hat \beta = \beta^p + \dfrac{\sum(x_i - \bar x)u_i}{S_{xx}} \]
Taking the expectation:
\[ E[\hat \beta] = \beta^p + \dfrac{\sum E ([x_i - \bar x]u_i)}{s_{xx}} \]
And we have assumption \(E[u_i|x_i] = 0\) meaning they are independent so we can rewrite it as
\[ E[\hat \beta] = \beta^p + \dfrac{\sum [x_i - \bar x]E(u_i)}{s_{xx}} \]
Knowing that \(E(u_i) = 0\), the whole mess disappears
78. Variance of OLS <SE> Estimators part 1
If we are interested in test score explained by class size, a statistical software may get the equation
\[ TS = 50 - 10CS \]
where class room size is a dummy variable
\[ CS \begin{cases}1 &,30\\ 0 &, otherwise \end{cases} \]
Are these values due to sampling error or its really true?
If the true population is 0, and sampling distribution is wide, its easy to get -10 from sampling error.
But if the estimator is efficient (has few sampling variance), it can’t get -10 due to sampling error cuz the sampling distribution is narrow
So knowing the variance of the estimator is crucial to decide if its value is from sampling error or not
Note: we actually mean standard error which is the standard deviation of the sampling distribution
To calculate the standard error
\[ \hat \beta = \beta ^p + \dfrac{\sum(x_i - \bar x)u_i}{s_{xx}} \]
get variance knowing \(\beta ^p\) is constant so has variance 0 and covariance 0
\[ var(\hat \beta | x_i) = var\dfrac{\sum(x_i - \bar x)u_i}{s_{xx}} \]
and assuming no serial correlation, there is no covariance terms
\[ \boxed{var(\hat \beta | x_i) = \dfrac{\sum var[(x_i - \bar x)u_i]}{s^2_{xx}}} \]
79. Variance of OLS estimators part 2
Going further, we assume heteroscedasticity, \(var(u_i|x_i)=\sigma^2\)
\[ var(\hat \beta| x_i) = \dfrac{\sum (x_i - \bar x)^2 var(u_i)}{s^2_{xx}} \]
which is equal to
\[ \sigma^2 \dfrac{\sum (x_i - \bar x)^2}{(\sum (x_i - \bar x)^2)^2} \]
We can cancel the numerator with one of the denominator to get:
\[ \boxed{var(\hat \beta | x_i) = \dfrac{\sigma^2}{\sum (x_i - \bar x)^2}} \]
Of course, we don’t know \(\sigma^2\) so we estimate it
80. Estimator for the population error variance
If we regress y on x and u where \(u \sim N(0, \sigma^2)\)
then variance is
\[ Var(u) = E[(u-E(u))^2] = E[u^2]= \sigma^2 \]
Then we get point estimates for \(\alpha, \beta, \hat u\) where \(\hat u\) is the residuals
To estimate the population variance
\[ \tilde \sigma ^2 = \dfrac 1 N \sum \hat u_i^2 = \dfrac 1 N [\hat u_1^2 + \hat u_2^2 + \dots + \hat u_{N-1}^2 + \hat u_N^2] \]
and the variance of last two terms = 0, cuz we have \(N-2\) degree of freedoms due to the two constraints that we got to estimate \(\alpha, \beta\) which are
\[ \sum \hat u_i = 0, \sum x_i\hat u = 0 \]
So the expectation is
\[ \boxed{E[\tilde \sigma^2] = \dfrac{N-2}{N} \sigma^2} \]
The correct estimator divides by \(\dfrac{1}{N-2}\)
\[ \boxed{E[\hat\sigma^2] = \dfrac{1}{N-2} \sigma^2} \]
If we have k regressors, it will be \(\dfrac{1}{N-k}\)
81. Estimated variance of OLS estimators - intuition
We reached that
\[ \boxed{Var(\hat \beta|x_i) = \dfrac{\dfrac{1}{N-2}\sum \hat u_i^2}{\sum(x_i - \bar x)^2}} \]
This is the formula used by statistical packages
The denominator is variance of \(x\), as x varies more, denominator gets bigger, the whole value falls
\[ var(x)\uparrow, denom \uparrow, frac \downarrow \]
Why? for the denominator
Imagine a scatterplot where the x are clustered in a tiny area, you are not confident that your line is actually the best. Perhaps you can fit many lines in this small cluster
You get more confident as you have more spread in the \(x\) values
The nominator measures how much my model fits the data
82. Variance of OLS estimators in presence of heteroscedasticity
In the normal case
\[ \begin{align*} var(\hat \beta | x_i) &= \dfrac{\sum var[(x_i - \bar x)u_i]}{ssx^2}\\ &= \dfrac{\sigma^2 \sum (x_i - \bar x)^2}{ssx^2} \\ &= \dfrac{\sigma^2}{\sum (x_i - \bar x)^2}\\ &= \dfrac{\sigma^2}{ssx} \end{align*} \]
But what if there is heteroscedasticity?
\[ Var(u_i|x_i) = \sigma^2_i \]
So we estimate \(\sigma^2_i\) as \(\hat u_i^2\) <notice the \(i\)>
\[ \boxed{var(\hat \beta| x_i) = \dfrac{\sum (x_i - \bar x)^2 \hat u_i^2}{s^2_{xx}}} \]
This is called robust standard error or white standard error
83. Variance of OLS estimators in presence of serial correlation
again, this is the normal case
\[ var(\hat \beta | x_i) = \dfrac{\sigma^2}{\sum (x_i - \bar x)^2} \]
If we have serial correlation
\[ cov(u_t.u_{t+j}) = \rho^j \sigma^2 \]
which is \(AR(1)\) process (more on this later)
so variance becomes <assuming \(\bar x = 0\)>
\[ var(\hat \beta|x_t) = var(\dfrac{\sum(x_t - \bar x)u_t}{ss_{x}}) \]
We can expand it to get
\[ \dfrac{\sigma^2}{ssx}+ \dfrac{2 \sum_{t=1}^{T-1} \sum_{j=1}^{T-t} x_t x_{t+j} \, E[u_tu_{t+j}]}{ssx^2} \]
the \(E\) part is the covariance term cuz we have \(\bar u =0\). Finally, we replace the covariance with \(\sigma^2 \rho^j\)
\[ \dfrac{\sigma^2}{ssx}+ \dfrac{2\sigma^2 \sum_{t=1}^{T-1} \sum_{j=1}^{T-t} x_t x_{t+j} \rho^j}{ssx^2} \]
What does the second term mean?
if we plot the error with respect to time, we will have runs of positives and negatives or in other words the \(\rho\) is greater than zero in the positive runs so the whole term is \(>0\)
Summary:
if we use the normal standard errors without taking into account the covariance term, we will be under estimating the value
84. Gauss Markov conditions summary of problems of violation
| Assumption | Bias | Inefficiency | Other |
|---|---|---|---|
| Zero conditional | |||
| mean of errors | ✅ | ||
| No perfect | |||
| collinearity | can’t estimate | ||
| No serial | |||
| correlation errors | ✅in case of lag | ||
| dependent | ✅ | can’t rely on | |
| normal SE | |||
| Homoscedastic | |||
| errors | ✅ | can’t rely on | |
| normal SE |
85. Estimating population variance from a sample part 1 - Bessel correction
If we have a population of \(x_i\), with with normal error term
\[ x_i = \mu + \epsilon_i \]
then we take a sample and estimate \(\mu \text{ by }\bar x\) which is an unbiased estimator
\[ E[\bar x] = \dfrac 1 N \sum E[x_i]= \dfrac 1 N N \mu = \mu \]
as for \(\sigma^2\), we know from degrees of freedom part that
\[ \tilde \sigma^2 = \dfrac 1 N \sum (x_i - \bar x)^2 = \dfrac 1 N [(x_1 - \bar x)^2 + \dots + (x_N - \bar x)^2] \]
and by taking expectations
\[ E[\tilde \sigma^2] = \dfrac 1 N [\sigma^2 + \sigma^2 + \dots + 0] \]
cuz we used one degree of freedom to estimate \(\bar x\)
\[ E[\tilde \sigma^2] = \dfrac{N-1}{N} \sigma^2 \neq \sigma^2 \]
which is biased but consistent cuz as \(N\) goes to infinity, \(\dfrac{N-1}{N} \to 1\) this is why its consistent
The unbiased estimator is
\[ \hat \sigma^2 = \dfrac{N}{N-1} \tilde \sigma^2 \]
cuz
\[ E[\hat \sigma^2] =\dfrac {N}{N-1} \left [ \dfrac {N-1} {N}\sigma^2 \right] = \sigma ^2 \]
This is called Bessel’s correction
86. Estimating population variance from a sample part 2
Another way to write the unbiased estimator is
\[ \hat \sigma^2 = \dfrac{1}{N-1}\sum(x_i - \bar x)^2 \]
think of sample mean as least square estimator. Let the cost function \(C\) be
\[ C = \sum(x_i - \tilde x)^2 \]
To minimize the sum, we differentiate
\[ \dfrac{\delta c}{\delta x} = -2 \sum (x_i - \tilde x) = 0 \]
so to be minimal, \(\tilde x\) has to be \(\bar x\)
\[ \tilde x = \bar x = \dfrac{1}{N} \sum x_i \]
In general, \(\bar x \neq \mu\) cuz \(\bar x\) minimizes the deviation more than any other estimator, so its always the case that
\[ \sum (x_i - \bar x)^2 \le \sum (x_i - \mu)^2 \]
If we calculate \(\bar \sigma^2\) using the \(\mu\) part, it will be unbiased
But using the \(\bar x\) instead, its expectation is less than true population variance
Extra: Example
\[ \begin{align*} \mu = \dfrac{2+4+6}{3} = 4\\ (2-4)^2 + (4-4)^2 + (6-4)^2 = 8\\ \bar x = \dfrac{2+4}{2} = 3\\ (2-3)^2+(4-3)^2 = 2 \end{align*} \]
The sum of squared deviations using \(\bar x\) is \(\le\) \(\mu\)
87. Problem set 2
Practical: factors affecting NBA wages
Theory: issues of heteroscedasticity, multicollinearity, endogeneity