This laboratory activity starts with the example of a single linear regression based on a single realization of error in the dependent variable, and then repeats that analysis with additional realizations of error for the purpose of performing sensitivity analyses or Monte Carlo uncertainty analyses. Both of those tasks are greatly facilitated by writing some sort of loop in a computer program, providing a couple different perspectives on the iterative logic inherent in many computer programming tasks (hence the ubiquity of the "loop" program control structure in most programming languages). The Monte Carlo analysis is also a task that might be repeated multiple times with different configurations (e.g., perhaps for a homework assignment), which means it provides a good example of the benefits of writing your own functions.Â
This video provides an overview of the main goal of this module based on the code developed in the last module. We would like to adapt the code to provide illustrations of how error propagates through a regression analysis, from the presumed error in the dependent variable to the statistics quantifying confidence in the inferred model parameter values (2:55 min).
Computers are excellent at completing monotonous repeated calculations for us. The trick is knowing how to tell them what to do each time they repeat the calculation. We can introduce the concepts of programmatic loops in R based on our interest in performing a linear regression with multiple different values used for the presumed error in our dependent variable (15:15 min).
We can make configuration of our loop more flexible by hardcoding fewer numbers. Let's use a program that knows how to run the loop as many times as necessary to examine the nature of stochastic modeling (9:07 min).
The previous example is just one step away from the code needed to perform a complete Monte Carlo propagation of uncertainty on a linear regression. In fact, it is almost the same code but with a slightly different context. Instead of changing the standard deviation used to generate the synthetic error each time through the loop, we leave the standard deviation at the same value for all the realizations such that they collectively define the ensemble of results that samples from a given assumed distribution of error in the dependent variable. Summary statistics (e.g., the standard deviation) of the distribution of the estimates of slope from this ensemble tell us something about our uncertainty in slope propagated from the original assumed uncertainty in the dependent variable. Also, the mean of the standard errors in slope from that ensemble tells us something about a more generalized estimate of that standard error that is not subject to the vagaries of a single standard error in slope generated by a single realization of the error in the dependent variable.
To implement the Monte Carlo analysis, we first need a slightly different perspective on the loop implemented in the previous example. In this case, we are going to implement graphing code within the loop to help us visualize how propagation of uncertainty via a Monte Carlo analysis works (11:52 min).
If we want to calculate summary statistics from the ensemble results, we need to do more than repeat the analysis and graph the results. We need to allocate data structures (in this case vectors) that can be used with an incrementing index to collect the ensemble of values one element at a time while looping through all the realizations. This will require a similar strategy to the creation of empty vectors used to collect the values for the standard error in the slope performed in the sensitivity analysis above (9:54 min).
After learning the iterative logic of loops, the next major rite of passage in computer programming is learning how to write your own functions. While writing functions is not strictly necessary to complete the activities in this class, being able to write a function that will complete a full Monte Carlo analyses based on arguments to the function that allow for different configurations among those analyses may ultimately make your life easier. Let's start with the fundamental nature of functions and arguments to functions in R, including how to define them yourself (10:52 min).
Now for a couple simple examples of features of the Monte Carlo analysis program that can be abstracted as useful functions. When writing functions, clear documentation of the function is critical and should always include a description of the job the function is intended to perform, a detailed list of the arguments the function needs to do its job, and a detailed description of what the function returns after it has completed its job. Let's start with a function that performs the mathematics of our linear model and learn how to properly comment this function's job (11:24 min).
Another useful function might be the ability to add random noise to any vector of numbers, which is a feature we need for generating realizations of a Monte Carlo analysis (5:24 min).
Once the logic of writing functions is clear. There is no limit on the complexity of the features a given function may provide. Let's work through a more complex example where we write a function that performs a Monte Carlo propagation of error analysis on a linear regression, where the arguments to the function allow the key characteristics of the analysis (e.g. slope, intercept, independent variable values, standard deviation of error in the dependent variable, etc.) to be changed each time the function is called (13:55 min).
The following script is a working example of the sensitivity analysis code constructed in the videos.
The following script is a working example of the Monte Carlo analysis code constructed in the videos.
The following script is a working example of the Monte Carlo analysis code using the definition of functions that might be useful for the homework.
Once you are writing your own functions, you might want to store them in separate files that allows you to use the same definition of a function from several different scripts. Of course, you could just copy the function code to another script manually, but this is the basis of the "cut and paste" antipattern that creates bugs that are hard to find. Let's say you find a bug in a function that you have copied to many other scripts. Will you remember all the places where that function exists if you need to fix that bug in all of them? Ultimately, reusable R code is much easier to manage by writing R packages, but that is definitely out of the scope of these lessons. However, scripts can be easily loaded from other R files using the source() function. Here is an example of relocating function definitions to separate files and the use of relative paths and working directories for portable code. (11:32min).
While this lab activity does not require writing R code that reads data from or writes to the file system, the concepts of relative paths and working directories open that topic for discussion. This topic is often referenced as "file IO" features of programming languages, where the "IO" stands for input-output. The simplest way to read or write data in your file system via an R program is via the functions that store data as ".RData" files. Here is a brief overview that might get you started on the basics of saving data to your file system for later use in subsequent R programs (10:31 min).
The following document provides an additional reference for the materials presented in the videos above. The first two sections of this document cover the linear regression review materials in this module. The sensitivity analysis introduced in the third section of this document will be the topic of the start of the next module as a gateway to discussion of Monte Carlo thinking.
Linear regression review (link to the full page HTML version)
Linear regression review (download the fully encapsulated HTML version)
Rmarkdown source code for linear regression review (download the Rmd file)
We will be covering the details of R data structures like vectors, lists, matrices, and data frames as we need them. However, the relevant details on R data structures we will use the most have been compiled into a single document. These notes attempt to cover the nature of R data structures that I have seen cause the worst misconceptions or the hardest to find bugs in student's code.
Notes on the basic R data structures used in this class (link to the full page HTML version)
Rmarkdown source code for detailed notes on basic R data structures (download the Rmd file)
Notes on the basic R data structures used in this class (download the postscript PDF file)Â
Most exercises will not require sophisticated graphing skills, and the materials for this class provide examples using base R graphics. However, base R graphics provide an incredibly flexible graphing tool and understanding just a few fundamentals gives you the capacity to tweak graphs to look exactly as you would like. The following is a document I have started (generated by Rmarkdown) that at least initiates a deeper dive into graphing with base R.
A deeper dive into graphing in base R (link to the full page HTML version)
A deeper dive into graphing in base R (download the fully encapsulated HTML version)
Rmarkdown source code for a deeper dive into graphing in base R (download the Rmd file)
A deeper dive into graphing in base R (download the postscript PDF file)