Exploring Linear Regression in Detail

The goal of this lab activity is to start putting your understanding of R to analytical work as quickly as possible. Once you know how to build and use atomic vectors, you can start using some of R's statistical tools to understand regression techniques. On our path to eventually understanding how to propagate uncertainty through non-linear regression analyses, our first step is to review linear regression and the nature of uncertainties reported by liner regression tools. The statistical goals of non-linear regression are exactly the same as linear regression, in that you are trying to find the model parameter values that minimize the error in fitting the model to observations. The only differences are the nature of the model and the nature of the algorithm necessary find the optimized parameter values. Our path starts with a review of some of the basic statistics we will be using heavily through this journey through linear and non-linear regression.

Topics covered in this module

A review of linear regression

Monte Carlo propagation of uncertainty

The list and matrix data structures in R

Extracting statistics from lm()

Slides and other video materials

Slides

Code from videos

Useful materials for reference or further study

A review of linear regression

First, let's review the the mean and standard variation (or variance) summary statistics (4:13 min).

The goals of linear regression and non-linear regression are identical, i.e., to find values for model parameters that minimize the residual error between a model and a set of observations. For a linear regression, the model is a straight line and the model parameters optimized are the slope and y-intercept of that line. A linear model simplifies the math necessary to find the optimized slope and intercept values, and a liner regression is relatively easy to visualize in terms of what changing the slope and intercept does to the model prediction. Therefore, let's review how a linear regression works before we move on to models that may be less easy to visualize (4:27 min).

Monte Carlo propagation of uncertainty

One of the best methods to test your understanding of a statistical analysis is to build a data set with known character relative to the analysis being done, then do the statistical analysis on those artificial data to be sure you get the results you expect. Let's review conceptually how to do that with a linear regression. This is also the logical pathway to understanding how a Monte Carlo propagation of uncertainty works, so you might also think of this as a single realization of a Monte Carlo uncertainty analysis (6:40 min).

With these statistical concepts in hand, let's code an example of a single realization of a Monte Carlo analysis of linear regression in R!

First, the basics of coding a linear model (6:58 min).

The character of the linear model is easier to see in a graph (3:32 min)

A Monte Carlo propagation of error starts with adding synthetic error to a model based on the presumed nature of the error in measurement of the dependent variable. We can use the rnorm() function if we want to presume that error in the dependent variable is normally distributed (7:01 min).

When adding more than one plot to the same axes in base R graphics, you will need to be sure the scale of the axis will allow all points to appear within the graphing region of the canvas (4:07 min).

In order to complete a single iteration of a Monte Carlo analysis, we need to perform a linear regression on the synthetic data we created. Here is how to do that with the lm() function in R (4:45 min).

The list and matrix data structures in R

The object returned by the lm() method is a data type we have not talked about, yet: a "list". One of the elements of that list references a another data type we have not talked about yet: a matrix. That matrix contains critical information about the uncertainty in the parameter value estimates from the regression. Ultimately these metrics of uncertainty are data that we want to be able to extract from the object returned by lm(). Let's return to the topic of fundamental R data types for a bit, to make sure we understand the nature of lists and matrices before digging deeper into this object.

Lists are a special kind of vector. But rather than each element containing an atomic type like a number or a character string, each element contains a reference to another object. This added level of abstraction or indirection is very useful in aggregating objects of different types. Therefore, lists are often the foundations for much more complex data structures (including the object returned by lm()). Because of this indirection, every element of a list-mode vector can be associated with an object of a different data type. I have seen many students burn a lot of time due to confusion about the behavior of lists vs. the behavior of atomic vectors, and it's worth understanding those differences in some detail. Here is a detailed review of the nature of the list data type in R (9:11).

After making this video, I realized some visualizations might help illustrate the difference between and atomic vector and a list. Here are two slides to help clarify that the atomic vector has elements that contain data and that the list vector has elements that contain references to other object. The first slide is a visualization of a list of atomic vectors of the same type (note there are animations to click through). The second slide is a visualization of a list of many different types of objects.

list_vis.pptx

Because lists are just special kinds of vectors, the elements of lists can be named the same way that elements of atomic vectors can be named. Let's take this opportunity to understand in more detail how the names of the elements are stored as an attribute of the vector object. All R object have the potential to have attributes associated with them, which are critical to how the information in the underlying data type is interpreted. Therefore, understanding how to examine and alter the attributes of an R object can be a critical step to understanding the causes of bugs not evident in the data themselves. (10:28).

The matrix data type will be our first example of an object that has a different "class" from the underlying R data type. The idea of a class is rooted in object-oriented principles, and it introduces the concept of inheritance in that more specific classes in R inherit the properties of the more general data types upon which they are built. For example, a matrix is a special kind of atomic vector, where the matrix inherits the properties of a general one-dimensional vector. But a matrix has additional attributes beyond an atomic vector that allow it to be interpreted more specifically as a two-dimensional table. Let's explore how that works and how it ultimately provides a useful tabular atomic data structure that can be indexed in two dimensions (15:06).

Extracting statistics from lm()

With some more detailed understanding of the list and matrix classes, we are now ready to dive back into the data structure returned by the lm() linear regression function (11:18).

Slides and other video materials

Slides

Please note that these slides are intended to provide the logical framework in between active sessions with R. There are some useful visualizations among these slides, but many are just bullet points intended to bridge logic and introduce the real time sessions with R exercises. These exercises are available in the videos. (In other words, I recognize that many of these slides are terrible without the context of the R session, and these slides should definitely not be used as an example of how to develop effective visualizations for a presentation.)

Click this link to download the MS PowerPoint file

The embedded Google viewer below sometimes provides poor renderings of Microsoft files. Use the link above to download the original file with proper formatting.

slides_linear_regression.pptx

Click this link to download the MS PowerPoint file

The embedded Google viewer below sometimes provides poor renderings of Microsoft files. Use the link above to download the original file with proper formatting.

slides_programming_R_2.pptx

Code from videos

The code created in the videos for a single realization of a Monte Carlo analysis of a linear regression.

Useful materials for reference or further study

Reference material on linear regression

The following document provides an additional reference for the materials presented in the videos above. The first two sections of this document cover the linear regression review materials in this module. The sensitivity analysis introduced in the third section of this document will be the topic of the start of the next module as a gateway to discussion of Monte Carlo thinking.

Linear regression review (link to the full page HTML version)

Linear regression review (download the fully encapsulated HTML version)

Rmarkdown source code for linear regression review (download the Rmd file)

Linear regression review (download the postscript PDF file)

linear_regression.pdf

Reference material on R data structures

We will be covering the details of R data structures like vectors, lists, matrices, and data frames as we need them. However, the relevant details on R data structures we will use the most have been compiled into a single document. These notes attempt to cover the nature of R data structures that I have seen cause the worst misconceptions or the hardest to find bugs in student's code.

Notes on the basic R data structures used in this class (link to the full page HTML version)

Notes on the basic R data structures used in this class (download the fully encapsulated HTML version)

Rmarkdown source code for detailed notes on basic R data structures (download the Rmd file)

Notes on the basic R data structures used in this class (download the postscript PDF file)

data_structures.pdf

Reference material on base R graphing

Most exercises will not require sophisticated graphing skills, and the materials for this class provide examples using base R graphics. However, base R graphics provide an incredibly flexible graphing tool and understanding just a few fundamentals gives you the capacity to tweak graphs to look exactly as you would like. The following is a document I have started (generated by Rmarkdown) that at least initiates a deeper dive into graphing with base R.

A deeper dive into graphing in base R (link to the full page HTML version)

A deeper dive into graphing in base R (download the fully encapsulated HTML version)

Rmarkdown source code for a deeper dive into graphing in base R (download the Rmd file)

A deeper dive into graphing in base R (download the postscript PDF file)

graphics.pdf

Page updated

Google Sites

Report abuse