Multiple Regression

Multiple Regression

The principles of simple linear regression lay the foundation for more sophisticated regression methods used in a wide range of challenging settings. In this section, we explore multiple regression, which allows us to conduct regression analyses with more than one predictor in the model.

Multiple regression extends simple two-variable regression in that the analyses still require only one response but, different from simple regression, allow for many predictors (denoted x1, x2, x3, …). The method is motivated by scenarios where many variables may be simultaneously connected to an output.

We will consider eBay auctions of a video game called Mario Kart for the Nintendo Wii. The outcome variable of interest is the total price of an auction, which is the highest bid plus the shipping cost. We will try to determine how total price is related to each characteristic in an auction while simultaneously controlling for other variables. For instance, all other characteristics held constant, are longer auctions associated with higher or lower prices? And, on average, how much more do buyers tend to pay for additional Wii wheels (plastic steering wheels that attach to the Wii controller) in auctions? Multiple regression will help us answer these and other questions.

The data set mario_kart includes results from 141 auctions. Four observations from this data set are shown in Table 1, and descriptions for each variable are shown in Table 2. Notice that the condition and stock photo variables are categorical variables that have been labeled 0 or 1. For instance, the cond_new variable takes value 1 if the game up for auction is new and 0 if it is used. Using indicator variables in place of category names allows for these variables to be directly used in regression. Multiple regression also allows for categorical variables with many levels, though we do not have any such variables in this analysis, and we not cover these details in this course.


A single-variable model for the Mario Kart data

Let’s fit a linear regression model with the game’s condition as a predictor of auction price. The model may be written as


Results of this model are shown in Table 3 and a scatterplot for price versus game condition.

Scatterplot of the total auction price against the game’s condition. The least squares line is also shown.

PRACTICE 1

Examine the scatterplot for the Mario Kart data set.

  1. Does the linear model seem reasonable?

  2. Interpret the coefficient for the game's condition in the model.

Including and assessing many variables in a model

Sometimes there are underlying structures or relationships between predictor variables. For instance, new games sold on Ebay tend to come with more Wii wheels, which may have led to higher prices for those auctions. We would like to fit a model that includes all potentially important variables simultaneously. This would help us evaluate the relationship between a predictor variable and the outcome while controlling for the potential influence of other variables. This is the strategy used in multiple regression. While we remain cautious about making any causal interpretations using multiple regression, such models are a common first step in providing evidence of a causal connection.

We want to construct a model that accounts for not only the game condition, as in the mario_kart example, but simultaneously accounts for three other variables: stock photo, duration, and wheels.


In this equation, y represents the total price, x1 indicates whether the game is new, x2 indicates whether a stock photo was used, x3 is the duration of the auction, and x4 is the number of Wii wheels included with the game. Just as with the single predictor case, a multiple regression model may be missing important components or it might not precisely represent the relationship between the outcome and the available explanatory variables. While no model is perfect, we wish to explore the possibility that this one may fit the data reasonably well.

We estimate the parameters β01,…,β4 in the same way as we did in the case of a single predictor. We select b0,b1,…,b4 that minimize the sum of the squared residuals:


Here there are 141 residuals, one for each observation. We typically use a computer to minimize the SSE and compute point estimates, as shown in the sample output in the table below. Using this output, we identify the point estimates bi of each i, just as we did in the one-predictor case.

PRACTICE 2

Write out the model

price = β0 + β1 × cond_new + β2 × stock_photo + β3 × duration + β4 × wheels


ŷ = β0 + β1x1 + β2x2 + β3x3 + β4x4

using the point estimates from the "output for the regression model where price is the outcome and cond new, stock photo, duration, and wheels are the predictors" table.

  1. How many predictors are in this model?

  2. What does β4, the coefficient of variable x4 (Wii wheels), represent? What is the point estimate of β4?

Answers

Practice 1.

  1. Yes. Constant variability, nearly normal residuals, and linearity all appear reasonable.

  2. Note that cond_new is a two-level categorical variable that takes value 1 when the game is new and value 0 when the game is used. So, 10.90 means that the model predicts an extra $10.90 for those games that are new versus those that are used.

Practice 2.

  1. ŷ = 36.21 + 5.13x1 + 1.08x2 - 0.03x3 + 7.29x4

There are k = 4 predictor variables.

2. It is the average difference in auction price for each additional Wii wheel included when holding the other variables constant. The point estimate for b4 = 7.29.


References:

  1. https://courses.lumenlearning.com/introstats1/chapter/introduction-multiple-and-logistic-regression/

LICENSES AND ATTRIBUTIONS

CC LICENSED CONTENT, SHARED PREVIOUSLY