(9/4/22) Imagine one sample of observations drawn. Each observation in the sample has 2 characteristics that are measured, X and Y. So we effectively have a sample of random variable X and a sample of random variable Y. Note, this is, by the way, where endogeneity comes from. If X_i and Y_i are both characteristics of the same observation, then they are likely to be correlated along other variables not captured. Random assignment of X in a way that is not correlated with Y alleviates this problem. A scatter plot shows you all the data for X and Y, laid out as the black dots in the example above.  What is the line of best fit and how does it have to do with beta – the regression coefficient = cov(X,Y)/var(X)?


People often say that beta is the slope of the line of best fit. But a slope m = (Y_1 – Y_0)/(X_1 – X_0)  and beta = cov(X,Y)/var(X)? In other words, we were taught that slope is “rise over run,” so how is beta the slope of the line of best fit if it equals (1/n)*Sum((X_i – mu_x)(Y_i – mu_y))/var(X)? Rise over run is from one point on the line relative to another. So the slope of the line is calculated by assuming that there is a line of best fit that runs through the data. Once this assumption is made, then we utilize a completely separate fact that the mean minimizes the sum of squared errors. We set up the sum of squared errors optimization problem but subtract out what we call the "conditional mean" rather than the mean. This "conditional mean" is what we have assumed is the line of best fit. It is the mean value of Y after we have assumed that Y is a linear function of X, and we have "conditioned" on X, by plugging all X values into the line.  Then we take the partial derivative of the least squares around the mean function, with the conditional mean embedded into it, with respect to the slope coefficient. This choses the slope of a line that minimizes squared errors around the line

It just so happens that the beta fits the beta = cov(X,Y)/var(X) framework because we chose to minimize squared errors. Does squared errors sound familiar? It is the definition of variance, squared errors around the mean. We have also assumed that the relationship between X and Y was linear. The mean of data is also linear (it is the sum of Y_1 +Y_2 + ... + Y_n for example, with each multiplied by a constant (1/n)). These assumptions together mean that the slope coefficient that minimizes squared errors is a function of the sum of Xs relative to the sum of Ys. We then divide by 1/n/1/n, which is the same as multiplying by 1, in order to express the argmin relative to means of X and Y. This is what allows us to express beta as cov(X,Y)/var(X). So this analogy between slope of a line and covariance and variance is by design. The optimization problem chosen (least squared errors around the mean (rather than absolute errors), where the mean itself was replaced by a function that maintains the affine structure of the mean, was chosen so that we could express the results in terms of means and variances. This is fitting, since after all we are dealing with random variables.


Additionally, the line is the resulting line that minimizes the mean squared error (MSE). But this still does not tell us how we get the line. Stata does not run a computational search over minimizing each MSE to get the line. If we did we would have to project out a line to start with, then minimize the errors from each point to the line and search over the lines that minimized all the MSEs.


We sum all data of X_i's all data of Y_i’s, not just two, because you want to use all of the data in the sample to have an efficient estimator. In other words, each point shown in the scatter plot is taken into consideration. Beta is the slope of the line of best fit. The reason that it "fits" all of the data points is because we take every observation into consideration. You divide by n to calculate the mean, which SW Chapt 2 tells you is the most efficient data point to understand the population mean. So now we know we want to use all the information in the data, what if you just summed all X's and divided by the number of x’s that there were? That would be a linear (affine) combination of X’s. The result is shown above by the light blue vertical line below at mu_x.


The question is do X and Y co-move together? The answer is: Let’s see. Let’s start with one variable at a time. X. We can calculate how X varies from its mean by calculating its deviations from the mean. Imagine you have the dataset:

Observation X Y

1 2 1

2 3 3

3 5 2

4 6 4


The mean of X is 4. Now same thing for Y. Y’s mean is 2.5. We can then plot X’s mean mu_x on the scatter plot in light blue, and Y’s mean mu_y on the scatter plot in red. Then the deviations of each X_i and Y_i from their means are plotted in little light blue and red errors. 

What the covariance is doing is taking the product of these deviations from X’s mean and Y’s mean. The product of these is shown in dark blue curved errors. This tells us how Y_i’s deviation from its mean varies with each unit that X_i's deviation from its mean. The covariance divides by the sample size to find the average product of X’s and Y’s deviations from the mean. When we divide by the variance of X, we standardize the units of X to one. So the regression coefficient is telling us how units does Y change for each change in X, relative to the mean. (Note, X and Y have already been demeaned in the covariance calculation). Covariance measures the linear relationship between 2 variables. It does not measure non-linear relationships (SW Chapt 3.7). This is because the covariance is essentially the dot product between X and Y's deviation from their means. This is the basis for the linear model. The reason we are asked to suppose there is a line that fits the data, and start there, because the standard tools that we have are linear in nature such as covariance and means. 

The intercept of the line of best fit is the mean of Y. So the line is telling us, what is Y equal to for a given X? For this, we put the mean of Y back in via the intercept, alpha. We set alpha equal to mu_y - beta*mu_x. and we use the slope to tell us how to vary Y for a unit change in X. This assumes that X and Y are linearly related, so that Y changes proportionally for every change in X. This assumption of linearity comes from using the covariance as the calculator, because it takes a linear combination of the movement between X and Y for every X and Y in the sample. 

Since the line of best fit minimizes the deviations of the observations from their means, it makes sense that the line of best fit will pass through the mean of both X and Y, since at the mean deviations from the mean equal zero. Is this always the case? How does this map to the linear model/OLS? Would linear algebra help us understand?


(8/28/22) The degrees of freedom is most often the sample size.

What we do in economics is take samples of a population. Examine that data. Get point estimates. Then calculate confidence intervals to know whether our point estimates are different from zero. Imagine our sample is size m. And each of the random variables (r.v.s) in that sample is iid normal. Then the estimate, and statistics based on the estimate, become random variables because we could resample m observations from the population multiple times. To calculate confidence intervals and evaluate estimates, we need to understand the distribution of the estimates.

The the size of the sample that is resampled is important for the distribution of the estimator. It is important because it is the degrees of freedom in Chi-squared, and Student-t distributions for example. To see this, let's consider degrees of freedom in the context of the Student t distribution. The Student t distribution comes in handy for hypothesis testing. As background, for hypothesis testing, we usually use the t-statistic and compare it to the critical values of a standard normal distribution. To calculate the t-stat, you first collect a sample of data then you calculate an estimator using the sample, let's say beta. To understand when to reject the null hypothesis, that beta equals 0, we need to know how the t-stat is distributed so that we can compare the t-stat calculated to critical values of its distribution. 

Under the null hypothesis the distribution that the estimator takes on if it is calculated over is over again is standard normal, under the law of large numbers when the sample size is large. The t-statistic standardizes an estimator. To calculate it, take beta, subtract the mean under the null, and divide by the standard error, which is the standard deviation of the estimator calculated over many samples. We then compare the t-stat to the critical value for the standard normal distribution at the level of significance that we would like to reject the null hypothesis, in this case beta = 0. 

However, when the sample size is small the distribution of the estimator under the null is not standard normal. This means the t-stat cannot be evaluated against the critical values for the standard normal. But if you are testing exactly one population, and you believe the population is normally distributed, then luckily, the t-statistic takes on the student t-distribution. This means we the t-stat is still informative because we can analyze it against the student t-distribution critical values.

The t-stat takes on student t because the null hypothesis is still that the estimate = 0. Therefore, the numerator is the distribution of the estimator, which is calculated on a sample, minus 0. You should think of the sample being taken many times on the population ~N(mu, sigma^2). The distribution of this estimator converges to normal because of the law of large numbers (LLN). Even if the sample size is m=10, you draw the sample, calculate the estimator, draw the sample, calculate the estimator, draw the sample, calculate the estimator, over and over again. Since the underlying population is normal, the distribution of all of the estimates calculated will converge to normal. 

The denominator of the t-stat is the standard error of the estimate. The standard error is the standard deviation of the estimates calculated, if the sample is redrawn and the estimate is recalculated multiple times. Since it is the standard deviation, it is the square root of the variance of the estimates. How is the variance, of the estimates calculated many times, distributed? To answer that, let's think about what the variance is. The variance of the estimates is calculated as the sum of the variances of each observation in a sample that is redrawn multiple times. For example, think of a sample size m=10 that is made up of 10 different observations, let's call them x_1, x_2, x_3, ..., x_10. If the sample size is m = 10, each observation in the sample is itself a random variable because it changes across each sample taken. Therefore to calculate the variance of the sample of size 10 taken repeatedly would be the variance of 10 independent random variables x_1, x_2, x_3, ..., x_10. Independent because we assume each observation is independent and identically distributed (iid). The variance of a standardized random variable is E(x-µ)^2 = E(x-0)^2 = E(x^2). 


Var(x_1+x_2+x_3+ ...+x_10) 

= Var(x_1)+Var(x_2)+Var(x_3)+...+Var(x_10) 

= [p_1(x_1a)+p_2(x_1b)+p_3(x_1c)+...+p_infinity(x_1infinity)]+[p_1(x_2a)+p_2(x_2b)+p_3(x_2c)+...+p_infinity(x_2infinity)]+...+[p_1(x_10a)+p_2(x_10b)+p_3(x_10c)+...+p_infinity(x_10infinity)].

To see this, think of each observation in sample size 10 as a place holder. Place holder x_i can take on infinite values if the population is infinitely sized and the sample is redrawn enough times. Then the distribution of each of these variances is a squared normal distribution. Moreover, in the variance calculation, since the observations are all assumed to be iid, each is equally likely and equally likely so the variance estimator weights them equally rather than having a probability distribution over x_1,...,x_10. Therefore the denominator of the t-stat is distributed as the sum of 10 squared normal distributions divided by 10. In other words it is chi-squared with m degrees of freedom divided by m.

This means that the t-stat is a random variable Y = Z/(W/m)^(1/2) distributed Student-t with m=10 degrees of freedom. With Z distributed standard normal, and W distributed Chi-squared with m = 10 degrees of freedom.

Consistent with the interpretation of degrees of freedom being the sample size, when SW 3rd Edition p. 66 talks about efficient estimators, it says Y^bar is more efficient than Y_1. This is because Y^bar uses all of the "information." What the book means by information, is that Y^bar uses all of the m observations collected in the sample. Conversely, Y_1 only uses the first observation from each sample drawn. SW p. 74 says that we divide the variance of the estimator, Y^bar, by (m-1). This is to correct for the slight downward bias that it has from estimating the population mean using Y^bar, the mean of the repeatedly drawn samples, rather than the true population mean, mu_y. It calls this a degrees of freedom correction.

Degrees of Freedom also make sense in the Linear Algebra sense where a degree of freedom is a row of a matrix. If we put each observation of a randomly drawn sample into a matrix, one observation would correspond to one degree of freedom.

*Please note that degrees of freedom is not always the sample size. Its meaning depends on what the estimator is.  For the sample mean in the example above, degrees of freedom is the sample size. For the Sargan-Hansen test for example, the degrees of freedom is not the sample size, it is the number of overidentifying restrictions (#instruments - #endogenous regressors). Thank you to Michael Gmeiner for helpful discussions on the Sargan-Hansen test.