Deriving OLS

https://ben-lambert.com/econometrics-course-problem-sets-and-data/

1. Undergraduate econometrics syllabus

We have a population that we can’t have its whole dataset. So we take a sample. We use statistical tool on the sample to estimate the population

Example:

\(\beta\) is average effect of one year of education on wage

\[ \text{wage} = \alpha + \beta educ + \mu \]

We don’t have the whole population, so we estimate \(\beta\)

Roadmap

  1. cross sectional data: Microeconomics
    1. OLS → BLUE
    2. Diagnostic tests for G-M assumptions
    3. When assumptions fail: IV,GLS, ML
  2. Time series data: Macroeconomics
    1. stationary
    2. AR(1) + MA(1)
  3. Panel data
    1. data generating process DGP
    2. between estimator
    3. fixed effects and random effects

2. What is econometrics?

Econometrics is statistical toolset to evaluate a relationship

Example: How education affects wage? if education increases, wage increases

\[ \text{educ}\uparrow \to \text{wage}\uparrow \]

This is not causal relationship, then we plot the data, and put a line between the points to capture average relationship. This is micro econometrics

is sales affected by tv advertising? we plot sales with respect to time of advertising, This is time series, micro econometrics, cuz data focuses on one firm ****

how interest rate affects inflation rate? This is macro econometrics

All the data above have one thing in common:

  • we have a population/ data generating process DGP, within this population we have a relationship like how how wage is related to education, and there are other factors in the population error term like where individual lives

    \[ \text{wage} = \alpha + \beta \text{educ} + u \]

  • We don’t have the whole data, so we get a sample

  • we use statistics with sample to estimate the population

  • we get sampling error cuz sample is different from the population

  • we use tools to reduce the sampling error as much as possible

3. Econometrics vs hard science

In science we check if \(A \to B\) A affects B

We use experiment, control tubes, that are exact replica except for one item A, if A caused B, then its causal relationship

EX: two identical tubes, we add substance in one tube and it generates \(CO_2\). We can say that the substance caused carbon dioxide

In econometrics, we deal with a question like: effect of military participation on life income

To test that in an experimental way

  1. get two identical twins
  2. one goes to militarily, second stays a civilian
  3. we track them and see life income
  4. we repeat this experiment with multiple twins

Since that this is not feasible , we do it in another way

Non experimental data

  1. get life income of individuals
  2. we know if each person went to military or nope
  3. we plot the data for both, circles: civilians, cross: military
  4. we get average life income for both groups
  5. the difference in averages is not causal cuz there are other factors
  6. an example of other factors:
    1. those who go to military already have lower life income aka reverse causal relationship <they don’t plan to be entrepreneurs>.
    2. life time
  7. So the difference is coming from two sources not just one: causal effect, reverse causal effect

Good luck handling the subject

4. Natural Experiments

\[ LI = f(M,LIP) \]

Life income is a function of going to military and life income potential

Unfortunately, we have selection bias, cuz those who entered the military are not random

A clever solution by Joshua Angrist 1990:

In Vietnam war, people were assigned numbers from 1 to 366 based on birthday, a number was chosen randomly, and those who are born in this number have to enter the military.

Since people entering the military became random, we avoided selection bias and could measure the relationship, this was an example of instrumental variable IV

5. Populations and samples

Population can be anything like:

  • people under 18 in UK
  • people who smoke in UK

The important thing is that it includes all the characteristics we care about

Each individual wage is affected by education in a certain way

\[ \text{wage}_i = \alpha + \beta_i \text{educ} \]

We can sum it together to get the population process representing the average effect

\[ \text{wage} = \alpha + \beta \text{educ} \]

\(\beta\) here is the average effect of one year of education to wage

sample

we don’t have the whole population, so we take some individuals and call it a sample and then sum them

\[ \text{wage} = \alpha^s + \beta^s \text{educ} \]

Notice that \(\beta, \beta^s\) are not equal due to sampling error

Econometrics deal with how to estimate the population parameter and minimize sample error

6. Estimators the basics

back to wage by education example, the \(\beta\) is called parameter cuz its involved with population, we estimate it using sample estimator \(\hat{\beta}\). the estimator is a function, when its assigned a value its called point estimate \(\beta^*\).

Next: we will focus on properties that makes the estimator good

7. Estimator properties

The estimator has to have some properties to represent the parameter. But first, here is what happens under the hood

we can get multiple samples from one population, each sample can get an estimate \(\beta_1^*, \beta_2^*, \dots, \beta_n^*\)

The estimates are not equal to each other nor equal to the population parameter due to sampling error, but they will form a normal distribution

The properties we want:

  1. on average, the estimator is the same as parameter aka unbiased \(E(\hat{\beta}) = \beta^p\)

  2. as sample size increases, estimators values approach parameter until the distribution becomes a straight line above \(\beta^P\)

    \(n \to \infty, \hat{\beta} \to \beta^p\) aka consistency

  3. efficiency

8. Unbiasedness and consistency

In the sampling distribution:

  1. unbiasedness means the expectation of estimator is parameter
  2. consistency means as sample size increase, distribution keeps shrinking until it becomes a straight line aka closer and closer to the parameter

Note: an estimator can be biased but consistent, imagine the sampling distribution centered on the right aka skewed, but as some size increases, the peak keeps shifting to the left until its above the parameter

Usually: estimators are biased but consistent

we don’t care about a biased and inconsistent estimator

9. Unbiasedness and consistency: an example

We have a population with mean \(\mu\), but we don’t know its value, so we get a sample.

We calculate an estimator \(\tilde{x}\) which is calculated as

\[ \tilde{x} = \dfrac{1}{N-1}\sum x_i \]

We know that the population has parameter \(\mu\) and each individual has parameter \(x_i =\mu + \epsilon\). Lets check unbiasedness

\[ E[\tilde{x}] = \dfrac{1}{N-1}\sum E[x_i] = \dfrac{1}{N-1} Nu = \dfrac{N\mu}{N-1} \neq \mu \]

The estimator is biased, but is it consistent?

Yes, cuz N/N-1 = 1 if they are infinite

\[ N \to \infty,E[\tilde x] \to \mu \]

10. Efficiency of estimators

An estimator is efficient if the variance of its sampling distribution is small (aka its not wide)

Two estimators can be unbiased, but one can be more efficient

Note: an estimator can be biased but more efficient that the unbiased estimator

11. Good estimator properties summary

  1. Unbiased \(E(\hat \beta) = \beta^P\)
  2. Consistent \(n \to \infty \qquad \hat \beta \to \beta^P\)
  3. Efficient
  4. Linear in parameters \(\hat \beta \sim \text{Linear function in parameters}\)

12. Lines of best fit in econometrics

Back to wages by education example. There is a positive relationship between education and wages, so we plot them on a scatter plot

Then we add a line to capture average effect, hence, it must be around the middle of the data representing it.

The slope of the line is \(\beta\) represents the average effect of one more year of education on wages

educ wage
10 500
11 550

Effect here is 50. Why? cuz its a linear relationship

\[ y = mx+c\\wages = \alpha + \beta^p educ \]

Remember, we don’t have the whole data, so we take a sample

the \(\beta^s\) estimates \(\beta^p\)

In the sample, 10 years of education results in $600 on average, however, the true value in the population is $500, due to sampling error

Next: we drew the line by eyes, we need a mathematical way

13. Math behind drawing lines of best fit

The line in scatterplot is trying to minimize sum of squared distance between x and the fitted line

An alternative way for measuring distance is to take absolute difference between y and fitted y aka \(\hat{y}\) which is the point on the line

we take absolute value so points above the line don’t cancel the points below the line

\[ S = \sum^N_{i=1}(|y_i-\hat{y}|) \]

But this function is hard to differentiate, instead, we take the square

\[ S = \sum^N_{i=1}(y_i-\hat{y})^2 \]

This is easier to differentiate, and focuses more on big deviations, If we have an outlier, it can shift the whole line towards it cuz the distance is squared

Why not go to power 4 or more?

cuz the line will go to the outlier and leave the remaining points

14. Least squares Estimators as BLUE

The squared function described is called least square

Back to our wage by educ example, we will use least squares to estimate \(\alpha, \beta\)

we want our estimators derived from least squares to be:

unbiased, consistent, efficient

if gauss Markov assumptions are met, the least squares are best linear unbiased estimator BLUE

Best means its the most efficient and consistent

Remember that these properties come from the sampling distribution

Next: we use gauss Markov to derive the estimators

15. Deriving least squares estimators 1

We learnt that we will minimize absolute value or squared or power 4, and we chose the squared function

We will minimize:

\[ \boxed{S = \sum(y_i-\hat{y})^2} \]

minimize means differentiate and equate to zero

\[ \dfrac{\delta s}{\delta \hat \alpha} = 0 \qquad \dfrac{\delta s}{\delta \hat \beta} = 0 \]

and we know that

\[ \hat{y} = \hat{\alpha} + \hat{\beta}x_i \]

Lets rewrite the function

\[ S = \sum(y_i-\hat{\alpha} - \hat{\beta}x_i)^2 \]

Then differentiate with respect to \(\hat{\alpha}\) and \(\hat{\beta}\) then solve when derivative = 0

This forms first order conditions (FOCs)

\[ \boxed{\dfrac{\delta s}{\delta \hat{\alpha}} = -2 \sum(y - \hat\alpha - \hat \beta x_i) = 0} \]

\[ \boxed{\dfrac{\delta s}{\delta \hat{\beta}} = -2 \sum x_i (y - \hat\alpha - \hat \beta x_i) = 0} \]

16. Deriving least square estimators 2

To continue, we need to know two formulas

$$

$$

and

\[ \boxed{\begin{align*} \sum(x_i-\bar x)(y_i - \bar y) &= \sum y_i (x_i - \bar x) \\&= \sum x_i (y_i - \bar y)| \end{align*}} \]

and

the bars are constant, so I can take it outside the sum aka

\[ \boxed{\sum \bar x = \bar x \sum 1 = N \bar x} \]

Here is the full proof

\[ \begin{align*} &~~~~\sum(x_i - \bar x)(y_i - \bar y)\\ &= \sum(x_i y_i - x_i \bar y - \bar x y_i + \bar x \bar y)\\ &= \sum x_i y_i - \bar y \sum x_i - \bar x \sum y_i + \bar x \bar y \sum 1\\ &= \sum x_i y_i - N \bar y \bar x - N \bar y \bar x+ N \bar y \bar x\\ &= \sum x_i y_i - N \bar y \bar x\\ &= \sum x_iy_i - \bar x\sum y_i\\ &= \sum y_i(x_i - \bar x) \end{align*} \]

17. Deriving least square estimators 3

Using the equations we proved in 2, we can solve first order conditions in 1.

IN \(\alpha\, FOC\), we divide both sides by \(-2\) to simplify, then we isolate \(y\)

$$ \[\begin{align*} \sum (y_i - \hat \alpha - \hat \beta x_i) &= 0\\ \sum y_i &= \hat \alpha \sum 1 + \hat \beta \sum x_i\\ N\bar y &= \hat \alpha N + \hat \beta N \bar x \\ \bar y &= \hat \alpha + \hat \beta \bar x \end{align*}\] $$

Notice that we know \(\hat y_i = \hat \alpha + \hat \beta x_i\)

Which tells me that line of best fit using OLS must pass through \(\bar x, \bar y\)

\[ \boxed{\bar y_i = \hat \alpha + \hat \beta \bar x_i} \]

18. Deriving least square estimators 4

Continuing with the last equation derived, we can isolate \(\alpha\)

\[ \boxed{\hat \alpha = \bar y - \hat \beta \bar x} \]

Now, into second condition regarding \(\beta\)

\[ \begin{align*} \sum x_i (y_i -\hat \alpha - \hat \beta x_i) &= 0 \\ \sum x_i y_i &= \hat \alpha \sum x_i + \hat \beta \sum x_i^2\\ \sum x_i y_i &= \hat \alpha N \bar x + \hat \beta \sum x_i^2\\ \sum x_i y_i &= (\bar y - \hat \beta \bar x) N \bar x + \hat \beta \sum x_i^2 \\ \sum x_i y_i &= N \bar x \bar y - \hat \beta N \bar x^2 + \hat \beta \sum x_i^2 \end{align*} \]

19. Deriving least square estimators 5

Continuing with last function:

\[ \begin{align*} \sum x_i y_i - N \bar x \bar y &= \hat \beta (\sum x_i^2 - N \bar x ^2)\\ \hat \beta &= \dfrac{\sum x_i y_i - N \bar x \bar y}{ \sum x_i^2 - N \bar x ^2} \\ \hat \beta &= \dfrac{\sum(x_iy_i - \bar x_i y_i)}{\sum(x_i^2-\bar xx_i)}\\ \hat \beta &= \dfrac{\sum(x_i - \bar x)(y_i - \bar y)}{\sum(x_i - \bar x)^2} \\ \hat \beta &= \dfrac{cov(x_i,y_i)}{var(x_i)} \end{align*} \]

20. Least squares estimators in summary

we want to estimate effect of education on wage, to do so, we fit line of best fit.

Why?

to infer the average effect of education using our sample

We get this line by minimizing sum of distance between y and fitted y

remember that fitted y is just alpha + beta * x

alpha is the intercept, beta is the slope

Notice that \(\hat \alpha, \hat \beta\) may not be equal to \(\alpha, \beta\)

Next: Quick probability recap

21. Taking expectations of a random variable

Discrete:

If we have a fair die, a random variable \(x\) represents the value the die will show \(1,2,\dots,6\). \(P(X=x)\) for each value is \(\dfrac 1 6\)

Then \(E(X) = \sum P(X=x)\), expectation is the average value we get if we keep rolling the die infinitely. Intuition, we know that average is in the middle, hence its 3.5.

Using formula:

\[ E(X) = \sum P(X=x)= P(X=1)+ P(X=2)+\dots = 1/6*1 + 1/6*2+ \dots +1/6*6=3.6 \]

Notice that expectation is just weighted sum

Continuous:

heights of people are never exact, so we plot them and get \(PDF\) then integrate

\[ E[X]= \int_{-\infty}^{\infty}f_xxdx \]

22. Moments of a random variable

expectation \(E[X]\) is taking the average using sum or integral. And it means the average value as N goes to infinity

We can also get \(E[X^2]\) using law of unconscious statistician

\[ E[X^2]= \int^{\infty}_{-\infty} x^2f_xdx = \lim_{N \to \infty} \dfrac{X_1^2+X_2^2+\dots}{N} \]

This is also called second moment, moment name is cheated from physics

We can do this using any power not just 1 and 2

23. Central moments of a random variable

If we have a PDF, which is just a straight line with X = 10, aka a constant; Its expectation is 10, cuz on average, \(x\) value is 10

\(E[X^2]=10^2=100\). This value is not interesting, we didn’t learn anything new about the data

Instead, we do \(E[(X-10)^2]=0\) which tells us how on average, the data deviates from the mean. This is called second central moment, Variance. It tells us about the shoulders

Notice that \(Var(x) \ge 0\)

Power 4 will tell us about the tails

24. Kurtosis

We can have two distributions that have same variance and mean. To compare them, we see the fourth moment that focuses on the tails

Example: if expectation = 10, variance = 2

\[ E[(X-10)^4]= \dfrac{(x_1-10)^4+(x_2-10)^4+ \dots}{N} \]

when anything is raised to power 4, the big values get way bigger, so values on tails will make huge difference.

To be able to compare the values between distributions, we take a ratio by dividing kurtosis by standard deviation power 4,

Why power 4?

so both of them be same dimension, then subtract 3 which is kurtosis of normal distribution to get excess kurtosis

\[ \boxed{\gamma = \dfrac{\mu_4}{\sigma^4}-3} \]

25. Skewness

A distribution can be symmetric or asymmetric, positively skewed has tail on right.

To calculate the skewness, get expectation around mean power 3

Why 3?

when x =0 (start of pdf), expectation power 3 is really a big number cuz distance from \(\mu\) is massive, then it keep decreasing until \(x=\mu\), then it keeps increasing again.

If its not symmetric, the expectation of the cubed difference will not be = 0 (not balanced around zero)

To be able to compare, we divide it by standard deviation cubed (to have same dimension)

\[ \boxed{\text{skewness} = \dfrac {\mu_3}{{(\sigma^2)}^\frac 3 2}} \]

26. Properties of expectation and variance

Multiplication by a constant

when multiplied by a constant, constant goes out of the integral cuz it does not vary

\[ \boxed{E[aX]=a \int^{\infty}_{-\infty} xf_xdx = aE[X]} \]

for the variance, notice that \(a\mu_x\) part came from \(E[aX]\)

\[ \boxed{Var(ax) = E((ax-a\mu_x)^2)= a^2var(x)} \]

why \(a^2\)? cuz variance is the squared distance

Addition

\[ \boxed{E[aX+bY]= aE[X]+bE[]= a\mu_x+b\mu_y} \]

for the variance:

\[ \boxed{Var(aX+bY)= a^2 Var(X)+ b^2Var(y)+2Cov(X,Y)} \]

Here is the full proof

\[ \begin{align*} & var(aX+bY)\\ &= E((aX+bY-a \mu_x - b \mu_y)^2)\\ &= E((aX-a\mu_x)+ (bY- b \mu_y)^2)\\ &= a^2 var(X)+ b^2 var(y)+ 2ab~ cov(X,Y) \end{align*} \]

27. Covariance/correlation

Covariance:

if we have two variables with many values plotted on scatterplot. Positive relationship means if x value increases, y value increases.

Negative relationship: x increases, y decreases

To represent this relationship, we have covariate

\[ \boxed{Cov(X,Y) = E[(X-\mu_x)(Y-\mu_y]} \]

Intuition, if x is less than its mean and y is less than its mean, cov is \(+\), if they are opposite, its \(-\)

Relation bet. \(x,y\) Relation bet. \(u\) \(x-\mu_x\) \(y-\mu_y\) product
\(+\) \(x>u_x\) \(+\) \(+\) \(+\)
\(+\) \(x<u_x\) \(-\) \(-\) \(+\)
\(-\) \(x > u_x\) \(+\) \(-\) \(-\)
\(-\) \(x< u_x\) \(-\) \(+\) \(-\)

However, it is affected by units, so we take correlation

correlation:

divide covariance by standard deviation of each variable so we remove units. Covariance can’t be bigger than product of variance, so correlation is bounded between -1 and 1

28. Population vs sample quantities

In the previous sections, we were talking about population quantities not sample like \(E[X], Var(X)\), meaning: we know PDF (have the whole datasets)

In a sample, we have \(\bar x\) to estimate \(\mu\) and \(S^2\) to estimate \(\sigma^2\) where

\[ \boxed{\bar X = \dfrac1 N \sum x_i} \]

\[ \boxed{S^2 = \dfrac 1{N-1} \sum(x_i - \mu)^2} \]

29. Population regression function

Population regression function is also called population process. wage is a function of education, we can plot it on scatterplot

As education increases, options for wage increase <can work as a plumber, or as an academic or banker>, we fit a line. wages is not perfectly a function of education, there are other factors thrown in error or disturbance term \(u_i\)

\[ \text{wages} = \alpha^s+ \beta^s \text{educ} + u_i \]

Example: a person with 15 years of education can be a rich banker or a poor academic

The errors are distributed around the fitted line normally aka most errors are close to the line, few are far (we don’t assume normality though)

But we expect error term to be

\[ \boxed{u_i \sim iid(0, \sigma^2)} \]

independent: knowing one error will not make me know another term

identically distributed: they came from the same underlying process (so width of scatterplot does not change)

Notice: error term has mean of zero, so it vanishes when taking expectations

\[ E[wages|educ] = \alpha^2 + \beta^2 educ \]

30. Problem set 1

It has theory of estimators, and some practical questions that can be solved, if you don’t know programming, use GRETL

practical example:

relationship between crime and unemployment across US states

Theoretical: theory of estimators and some mathematics

https://ben-lambert.com/econometrics-course-problem-sets-and-data/

problem-set-1-final.pdf

problemset1.csv

problem-set-1-answers-final1.pdf