Introduction to Linear Regression

Linear regression for model building

Linear regression is frequently the "go-to" most practical approach to modelling the relationship between two or more variables. When we are convinced that a cause and effect can be observed between two variables then we designate a dependent variable and link this to one or other explanatory variables (or independent variables). Below, for example, we believe that the number of students per teacher influences test scores. The case of one explanatory variable is referred to as simple linear regression. For more than one explanatory variable, the process is referred to as multiple linear regression. Here using Excel and R, I set up a simple regression. Later, I will use an OLS model to measure credit risk and also to interpolate asset volatility. Please make reference to https://www.econometrics-with-r.org/4-lrwor.html . Introduction to Econometrics using R is the one of the more practical combinations of worked examples and econometric analysis which I strongly recommend. Please use link to access spreadsheet workings. The text is free but more importantly exploits interactive learning that blends R code with examples provided in the celebrated Stock & Watson (2015). Set up as an Empirical companion - the interactive script permits a reproducible research report style and enables students not only to learn how results of case studies can be replicated with R but also strengthens their ability in using the newly acquired skills in other empirical applications. Below in the video clips, I break apart some basic examples in Excel and then follow Christoph Hanck, Martin Arnold, Alexander Gerber and Martin Schmelzer for the rest.

The following R code was used to output Ordinary Least Squares parameter estimates. Compare the confidence intervals with spreadsheet.

# Chapter 5 Introduction to Econometrics with R but with

# simple regression

# Numbers changed for hypothesis testing

# https://www.econometrics-with-r.org/4-lrwor.html

# Create sample data

STR <- c(15, 17, 19, 20, 22, 23.5, 25)

Testscore <- c(680, 640, 670, 660, 630, 660, 635)

# Print out sample data

STR

Testscore

# create a scatterplot of the data

plot(Testscore ~ STR)

# estimate the model and assign the result to linear_model

linear_model <- lm(Testscore ~ STR)

# print the standard output of the estimated lm object to the console

linear_model

summary(linear_model)

plot(Testscore ~ STR,

     main = "Scatterplot of Testscore and STR",

     xlab = "STR (X)",

     ylab = "Test Testscore (Y)",

     xlim = c(10, 30),

     ylim = c(600, 720))

# add the regression line

abline(linear_model)

summary(linear_model)$coef

residuals(linear_model)

linear_model$df.residual

#β1 ∼ t5 p-value for a two-sided significance

2 * pt((-2.968015/1.965646), df = 5 )

# Not Close to Normal

2 * pnorm(-2.968015/1.965646)

# compute 95% confidence interval for coefficients in 'linear_model'

confint(linear_model)

# compute 95% confidence interval for coefficients in 'linear_model' by hand

lm_summ <- summary(linear_model)

c("lower" = lm_summ$coef[2,1] - qt(0.975, df = lm_summ$df[2]) * lm_summ$coef[2, 2],

  "upper" = lm_summ$coef[2,1] + qt(0.975, df = lm_summ$df[2]) * lm_summ$coef[2, 2])

The R code below provides for a manual working out of key metrics including SSR, TSS and ESS

# Chapter 4 Introduction to Econometrics with R

# https://www.econometrics-with-r.org/4-lrwor.html

# Create sample data

STR <- c(15, 17, 19, 20, 22, 23.5, 25)

Testscore <- c(680, 640, 670, 660, 630, 660, 635)

# Print out sample data

STR

Testscore

# create a scatterplot of the data

plot(Testscore ~ STR)

# estimate the model and assign the result to linear_model

linear_model <- lm(Testscore ~ STR)

# print the standard output of the estimated lm object to the console

linear_model

summary(linear_model)

plot(Testscore ~ STR,

     main = "Scatterplot of Testscore and STR",

     xlab = "STR (X)",

     ylab = "Test Testscore (Y)",

     xlim = c(10, 30),

     ylim = c(600, 720))

# add the regression line

abline(linear_model)

summary(linear_model)$coef

residuals(linear_model)

anova(linear_model)

# Manual Estimation

# define the components

n <- 7 # number of observations (rows)

k <- 1 # number of regressors

y_mean <- mean(Testscore) # mean of avg. Testscores

SSR <- sum(residuals(linear_model)^2) # sum of squared residuals

TSS <- sum((Testscore - y_mean )^2) # total sum of squares

ESS <- sum((fitted(linear_model) - y_mean)^2) # explained sum of squares

# compute the measures

SER <- sqrt(1/(n-k-1) * SSR) # standard error of the regression

Rsq <- 1 - (SSR / TSS) # R^2

adj_Rsq <- 1 - (n-1)/(n-k-1) * SSR/TSS # adj. R^2

# print the measures to the console

c("SER" = SER, "R2" = Rsq, "Adj.R2" = adj_Rsq)

ANOVA is important for considering the explanatory power of any model. Here, I take a slightly closer look:

t values are used to gauge the statistical significance of individual explanatory variables. The F-test lets you compare two competing regression models in their capacity to “explain” the variance in the dependent variable. The F-test is used mainly employed in ANOVA to assess the overall capacity of the model generated from the regression analysis.

Page updated

Google Sites

Report abuse