HOW IS BETA SLOPE? IT'S NOT WRITTEN AS "RISE OVER RUN"
(9/4/22) Imagine one sample of observations drawn. Each observation in the sample has 2 characteristics that are measured, X and Y. So we effectively have a sample of random variable X and a sample of random variable Y. Note, this is, by the way, where endogeneity comes from. If X_i and Y_i are both characteristics of the same observation, then they are likely to be correlated along other variables not captured. Random assignment of X in a way that is not correlated with Y alleviates this problem. A scatter plot shows you all the data for X and Y, laid out as the black dots in the example above. What is the line of best fit and how does it have to do with beta – the regression coefficient = cov(X,Y)/var(X)?
DEGREES OF FREEDOM IS SAMPLE SIZE
(8/28/22) The degrees of freedom is most often the sample size.
What we do in economics is take samples of a population. Examine that data. Get point estimates. Then calculate confidence intervals to know whether our point estimates are different from zero. Imagine our sample is size m. And each of the random variables (r.v.s) in that sample is iid normal. Then the estimate, and statistics based on the estimate, become random variables because we could resample m observations from the population multiple times. To calculate confidence intervals and evaluate estimates, we need to understand the distribution of the estimates.
The the size of the sample that is resampled is important for the distribution of the estimator. It is important because it is the degrees of freedom in Chi-squared, and Student-t distributions for example. To see this, let's consider degrees of freedom in the context of the Student t distribution. The Student t distribution comes in handy for hypothesis testing. As background, for hypothesis testing, we usually use the t-statistic and compare it to the critical values of a standard normal distribution. To calculate the t-stat, you first collect a sample of data then you calculate an estimator using the sample, let's say beta. To understand when to reject the null hypothesis, that beta equals 0, we need to know how the t-stat is distributed so that we can compare the t-stat calculated to critical values of its distribution.
Under the null hypothesis the distribution that the estimator takes on if it is calculated over is over again is standard normal, under the law of large numbers when the sample size is large. The t-statistic standardizes an estimator. To calculate it, take beta, subtract the mean under the null, and divide by the standard error, which is the standard deviation of the estimator calculated over many samples. We then compare the t-stat to the critical value for the standard normal distribution at the level of significance that we would like to reject the null hypothesis, in this case beta = 0.
However, when the sample size is small the distribution of the estimator under the null is not standard normal. This means the t-stat cannot be evaluated against the critical values for the standard normal. But if you are testing exactly one population, and you believe the population is normally distributed, then luckily, the t-statistic takes on the student t-distribution. This means we the t-stat is still informative because we can analyze it against the student t-distribution critical values.
The t-stat takes on student t because the null hypothesis is still that the estimate = 0. Therefore, the numerator is the distribution of the estimator, which is calculated on a sample, minus 0. You should think of the sample being taken many times on the population ~N(mu, sigma^2). The distribution of this estimator converges to normal because of the law of large numbers (LLN). Even if the sample size is m=10, you draw the sample, calculate the estimator, draw the sample, calculate the estimator, draw the sample, calculate the estimator, over and over again. Since the underlying population is normal, the distribution of all of the estimates calculated will converge to normal.
The denominator of the t-stat is the standard error of the estimate. The standard error is the standard deviation of the estimates calculated, if the sample is redrawn and the estimate is recalculated multiple times. Since it is the standard deviation, it is the square root of the variance of the estimates. How is the variance, of the estimates calculated many times, distributed? To answer that, let's think about what the variance is. The variance of the estimates is calculated as the sum of the variances of each observation in a sample that is redrawn multiple times. For example, think of a sample size m=10 that is made up of 10 different observations, let's call them x_1, x_2, x_3, ..., x_10. If the sample size is m = 10, each observation in the sample is itself a random variable because it changes across each sample taken. Therefore to calculate the variance of the sample of size 10 taken repeatedly would be the variance of 10 independent random variables x_1, x_2, x_3, ..., x_10. Independent because we assume each observation is independent and identically distributed (iid). The variance of a standardized random variable is E(x-µ)^2 = E(x-0)^2 = E(x^2).
To see this, think of each observation in sample size 10 as a place holder. Place holder x_i can take on infinite values if the population is infinitely sized and the sample is redrawn enough times. Then the distribution of each of these variances is a squared normal distribution. Moreover, in the variance calculation, since the observations are all assumed to be iid, each is equally likely and equally likely so the variance estimator weights them equally rather than having a probability distribution over x_1,...,x_10. Therefore the denominator of the t-stat is distributed as the sum of 10 squared normal distributions divided by 10. In other words it is chi-squared with m degrees of freedom divided by m.
This means that the t-stat is a random variable Y = Z/(W/m)^(1/2) distributed Student-t with m=10 degrees of freedom. With Z distributed standard normal, and W distributed Chi-squared with m = 10 degrees of freedom.
Consistent with the interpretation of degrees of freedom being the sample size, when SW 3rd Edition p. 66 talks about efficient estimators, it says Y^bar is more efficient than Y_1. This is because Y^bar uses all of the "information." What the book means by information, is that Y^bar uses all of the m observations collected in the sample. Conversely, Y_1 only uses the first observation from each sample drawn. SW p. 74 says that we divide the variance of the estimator, Y^bar, by (m-1). This is to correct for the slight downward bias that it has from estimating the population mean using Y^bar, the mean of the repeatedly drawn samples, rather than the true population mean, mu_y. It calls this a degrees of freedom correction.
Degrees of Freedom also make sense in the Linear Algebra sense where a degree of freedom is a row of a matrix. If we put each observation of a randomly drawn sample into a matrix, one observation would correspond to one degree of freedom.
*Please note that degrees of freedom is not always the sample size. Its meaning depends on what the estimator is. For the sample mean in the example above, degrees of freedom is the sample size. For the Sargan-Hansen test for example, the degrees of freedom is not the sample size, it is the number of overidentifying restrictions (#instruments - #endogenous regressors). Thank you to Michael Gmeiner for helpful discussions on the Sargan-Hansen test.