Estimating regression coefficients

The total length of the videos in this section is approximately 75 minutes. Feel free to do this in multiple sittings. You will also spend time answering short questions while completing this section.

You can also view all the videos in this section at the YouTube playlist linked here.

EstimatingRegressionCoefficients.1.EstimatingResidualVariance.mp4

Question 1: If the estimated slope is 2, the sample mean of X is 1, and the sample mean of Y is 5, what is the estimated intercept of the regression line?

Show answer

The intercept is Ybar - Xbar*Slope = 5-1*2 = 3.

Measuring the relationship between two variables


EstimatingRegressionCoefficients.2.RelationshipBetween2Vars.mp4

Question 2: If you calculate that sample correlation between income (in dollars) and age (in years) is 0.5 (making this up), and then you decide to report income in terms of thousands of dollars and recalculate, what will the sample correlation be equal to?

Show answer

0.5. The sample correlation does not change if you multiple on of the variables by a constant.

Using correlation to estimate slope


EstimatingRegressionCoefficients.3.CorrelationToEstimateSlope.mp4

Question 3: What is the estimated slope of the line, as described at the end of the video?

Show answer

If the standard deviations of X and Y are identical, then they cancel, and the estimated slope of the regression line is exactly the correlation between X and Y.

Justifying slope estimate as standardized correlation


EstimatingRegressionCoefficients.4.JustifyingSlopeEstimate.mp4

Question 4: If you estimated the slope of the linear regression line connecting mean income (in dollars) and age (in years) is 1000 (making this up), and then you decide to report income in terms of thousands of dollars and recalculate, what will the new estimated slope be equal to?

Show answer

The answer is 1. You decided to divide the Y variable by 1000. Therefore, the slope of the line will be reduced by a factor of 1000.

There are several more questions about the relationship between slope and correlation toward the end of this module, but for now I want to move on.

In this module, I am giving you several ways to understand why we calculate the estimated slope the way we do. Above, you learned about the first way to understand: slope is a standardized version of correlation. Below, you will learn about several more ways to understand how we justify the slope estimator. The videos below were recorded previously, and I had presented correlation as a justification last instead of first. So, if I am counting the different ways to understand, and I seem to have forgotten that we already talked about correlation, that is why!

Residuals

EstimatingRegressionCoeff.5.Residuals.mp4

Question 5: Which is bigger, on average, the residual of a data point given a separate means model v. the residual of a data point given an equal means model?

Show answer

The residual from a separate means model will be smaller, on average. In general, points are closer to the means in their own groups than to the overall mean, because each data point has a bigger influence on the mean in their own little subgroup than on the overall mean.

Justifying slope estimate by minimizing sums of squared residuals

EstimatingRegressionCoeff.6.MinimizingSums.mp4

Question 6: If the slope and intercept are selected to minimize the sum of squared residuals, which points can have a large influence on the choice of line? Describe the most influential points in a few words.

Show answer

Points with X values far from the mean of X, and points with Y values that deviate from the general pattern of the data. Choosing the regression line by minimizing the sum of the squared residuals leads to slope and intercept estimates that are heavily influenced by outliers. There are various types of outliers. Points with extreme X values will have a lot of influence, because a small change in slope leads to a large change in squared residual for these points. Points with Y values that are far from the line have a lot of influence because their residuals will be large. Note, though, that points that are extreme on X but follow the general line may not change the slope, and points that are extreme on Y but have X values near the mean of X may not change the slope. To see this, draw a picture and imagine where the line might go.

Justifying slope estimate by noticing that it is unbiased 

EstimatingRegressionCoeff.7.Unbiased.mp4

Question 7: Consider the following simulation. We choose the following model for the mean: Mean(Y|X) = 2+3*X. We set the sample size to n=3, with X values 1, 2, and 3. Then, repeatedly, we simulate the values of Y for these three data points by calculating Mean(Y|X) and adding (standard normal, independent) noise. For each simulated set of Y's, we estimate the slope of the regression of Y on X. We do this 1000 times, generating 1000 estimated slopes. What do we expect to see as the mean of these 1000 estimated slopes?

Show answer

The answer is 3, which is the true slope of the line. This is the definition of unbiasedness: the average of our estimates over all possible data sets drawn from a certain model is equal to the true slope in the model.

Justifying slope estimate by noting that it is an MLE

EstimatingRegressionCoeff.8.MLE.mp4

Question 8: What is "maximum likelihood estimation"? Try to describe it in just a few words.

Show answer

Our estimates are equal to the values that make the data most likely.

My favorite way to justify the slope estimate - a weighted average of slopes

EstimatingRegressionCoeff.9.WeightedAvgSlope.mp4

Question 9: Why would we multiply and divide by (Xi-Xbar)?

Show answer

If we rearrange the terms, this expression will now relate to the equation for the slope of a line.

More on the slope-to-center interpretation

EstimatingRegressionCoeff.10.SlopeToCenter.mp4

Question 10: What would happen if we calculated the slope between each point and the mean point and then took the unweighted average of these slopes?

Show answer

The new estimate would be less influenced by points with extreme values of X 

The usual estimate of the slope puts a lot of weight on the points with extreme values of X. If we weight each point equally, the points with extreme values of X will be less influential.

Now we will return to the earlier interpretation and ponder some questions about slope and correlation, to make sure you leave this lecture solid on this concept.

Question 11: X is height, measured in inches, and Y is weight, measured in pounds. Given a data set, we estimate correlation=.6 and slope=5. How do you interpret the slope?

Show answer

For each 1 inch increase in height, weight increases by 5 pounds.

Question 12: Continuing the previous problem. Suppose that we decide to record the weights in ounces instead of pounds. There are 16 ounces in a pound. Now what is the correlation? Now what is the slope?

Show answer

0.6 and 80. For each 1 inch increase in height, weight increases by 5*16=80 ounces. Make sure it makes sense to you that we multiplied by 16: when height goes up by an inch, the weight goes up 5 pounds, which is 80 ounces. Correlation does not change.

Question 13: Again continuing the same problem. Now we decide to record weights in terms of standard deviations from the mean weight. The standard deviation of the weight, in pounds, is 33.5. Now what is the correlation? Now what is the slope?

Show answer

0.6 and 0.15. For each 1 inch increase in height, weight increases by 5/33.5=.15 standard deviations. Make sure it makes sense to you that we divided by 33.5: when height increases by 1 inch, we gain 5 pounds, which is 0.15 of a standard deviation. The number of standard deviations gained should be less than the number of pounds gained because there are 33.5 pounds in 1 standard deviation.

Question 14: Last question about the same problem. Continuing to record weight in terms of standard deviations, we decide to also record height in terms of standard deviations. The standard deviation of heights, in inches, is 4. Now what is the correlation? Now what is the slope?

Show answer

0.6 and 0.6. For each 1 standard deviation increase in height, weight increases by .15*4 = 0.6 standard deviations. Make sure it makes sense to you that we multiplied by 4: when height increases by 1 inch, weight increases by 0.15 standard deviations, so when height increases by 4 inches, weight increases by 0.15*4 standard deviations. The slope of the standardized values is equal to the correlation. In other words, correlation is just slope where X and Y are measured in terms of standard deviation.

Question 15 (Review from previous module as a reminder): Which are the assumptions underlying regression/ANOVA? Check all that apply.

Show answer

All four should be checked. The best way to check the independence assumption is by thinking about the context. The best way to check the other assumptions is graphically - no hypothesis tests needed to check assumptions.

That's all, folks!