Simple Linear Regression

We will use the Simple Linear Regression dataset available at Kaggle for this exercise. This dataset has two variables: SAT Scores and GPA.

We can predict a student's GPA from their SAT Score using a simple linear regression model. In this case, GPA is the response variable, whereas SAT Score is the predictor variable.

Here is the R Code:

> slr<-read.csv("C:/Users/user/Desktop/MathFlurry/1.01. Simple linear regression.csv")

> head(slr)

ï..SAT GPA

1 1714 2.40

2 1664 2.52

3 1760 2.54

4 1685 2.74

5 1693 2.83

6 1670 2.91

> str(slr)

'data.frame': 84 obs. of 2 variables:

$ ï..SAT: int 1714 1664 1760 1685 1693 1670 1764 1764 1792 1850 ...

$ GPA : num 2.4 2.52 2.54 2.74 2.83 2.91 3 3 3.01 3.01 ...

> plot(slr$ï..SAT,slr$GPA,main="Scatter Plot of GPA vs SAT",xlab="SAT",ylab="GPA",xaxs="i",yaxs="i",xlim=c(0,2200),ylim=c(0,4))#plots the scatter plot in Fig. 1

Fig. 1: Scatter Plot of CPA vs SAT.

From Fig. 1, we can see that the values of the response variable, GPA, tend to increase with an increase in the values of the predictor variable SAT. It appears that this increase follows a linear trend. To evaluate this linear trend, we use simple linear regression.

R Code:

> model_1 <- lm(GPA~ï..SAT, data = slr)

> summary(model_1)


Call:

lm(formula = GPA ~ ï..SAT, data = slr)


Residuals:

Min 1Q Median 3Q Max

-0.71289 -0.12825 0.03256 0.11660 0.43957


Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.2750403 0.4087394 0.673 0.503

ï..SAT 0.0016557 0.0002212 7.487 7.2e-11 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Residual standard error: 0.2106 on 82 degrees of freedom

Multiple R-squared: 0.406, Adjusted R-squared: 0.3988

F-statistic: 56.05 on 1 and 82 DF, p-value: 7.2e-11

A regression line is a line that best describes the relationship between a predictor and a response variable.

From the R regression output above, we can see that the intercept of the regression line is 0.275, whereas the slope is 0.002 (both values have been rounded up to the nearest 3 decimal places). Thus, the equation of the regression line is GPA = 0.275 + 0.002 * SAT.

The "Coefficients" section of the R output above provides estimates, standard errors, t values, and p values for the intercept and slope of the regression line. Each row contains information for testing the null hypothesis that the value of a particular coefficient is zero.

We can also fit the regression line in the scatter plot as shown in Fig. 2.

R code:

> plot(slr$ï..SAT,slr$GPA,ylab="GPA",xaxs="i",yaxs="i",xlim=c(0,2200),ylim=c(0,4))

> abline(lm(GPA~ï..SAT,data=slr),col="blue")


Fig. 2: Regression Line on Scatter Plot.