The mathematics of how uncertainty propagates through a nonlinear regression is more complex than propagation of error through a linear regression. The nature of error propagation through nonlinear regression also varies from model to model. A brute force probabilistic method for propagation of error that will work for any model or analysis is the Monte Carlo approach. Using the logical scaffolding regarding Monte Carlo analyses built in the linear regression modules, the goal of this module is to give you the ability to perform a Monte Carlo propagation of error on a nonlinear regression.
A hyperbolic saturation model is used as an example below, but you will learn more if you challenge yourself to work through this exercise with a nonlinear model of your choice.
If you worked along with the videos in the previous module, then you should have code similar to the following script. You can either start with your own code or copy and paste the following script if you want to follow along with the coding activities in this module.
Note that the nlsSummary object created by this script does have a coefficients table similar to lm(), where a column for standard error reports a metric of uncertainty for each parameter being estimated by the nonlinear regression algorithm. This is not a statistic that is generally reliable for nonlinear regressions. It is only appropriate to use if the error in the dependent variable propagates linearly through the model, which is only true for a very small subset of nonlinear models. The Monte Carlo propagation technique that we practiced with linear regressions is a more robust and general technique for propagating error though nonlinear regression analyses. As with the linear regression exercise, adding a Monte Carlo analysis involves putting the analysis of a single realization of the error into a loop (9:53 min).
A Monte Carlo analysis does not do much good if the code doesn't remember something about the results for each realization. In this case, we will eventually want to track several statistics from each realization. This need for multidimensional data gives us an opportunity to introduce the data frame, another data structure in R that will be useful for tracking the results of each realization as a row of data in a two-dimensional data structure.
If you recall, a matrix is a two-dimensional data structure where every value in the matrix has to be of the same data type because it is an extension of an atomic vector. A data frame is also a two dimensional tabular data structure, but a data frame is built from a list of vectors. This data structure therefore allows every column in the table to have many rows of data each with differing atomic types. Let's review the fundamentals of the R data frame object and the implications of its implementation as a list of vectors (10:12 min).
Because a data frame ultimately inherits the properties of a list, the columns of a data frame can be indexed in the same way a list can be indexed. However, because a data frame is also a two-dimensional data structure, R also allows you to index it the same way as a matrix. Let's explore the indexing of data frames and some of the creative applications of indexing that allow you to parse or sort the rows with a single line of code (13:03 min).
Adding new rows or columns to data frames is relatively straightforward. However, for data frames that have more than a few thousand records, adding more rows to a data frame one at a time can be particularly inefficient in terms of both RAM and processor time. Data frames are not designed to be efficient dynamic data structures. As described in previous modules, it is best practice in computer programming to pre-allocate memory if you know the space that will be needed before the algorithm that populates the memory is executed. Let's review how to add columns or rows to data frames and how to pre-allocate the memory used by data frames for population by subsequent algorithms (12:26 min).
Armed with the ability to pre-allocate a data frame with enough rows to store results from each realization, we can now write code to collect an ensemble of results from a Monte Carlo analysis of a nonlinear regression (10:58 min).
With the ensemble of results recorded in a data frame, we can subsequently look at the probability distributions and quantiles that summarize the uncertainty in the estimated parameter values recorded in columns of that data frame (12:35 min).
If you wanted to be able to easily repeat the Monte Carlo propagation of error for a given nonlinear regression analysis (e.g., for completing a homework assignment asking you for a sensitivity analysis), you might want to write a function that will allow you to repeat the analysis with a single line of code, but also allow you to change the baseline parameter values for the model and assumed standard deviation in the error. Let's review how to abstract the analysis into a function allowing you to change those properties as arguments to the function, similar to the exercise of defining functions for the linear regression Monte Carlo analysis (32:02 min).
Here is an example script illustrating the use of this function to perform a sensitivity analysis. In this case we might be interested in how uncertainty in the dependent variable values might propagate to the uncertainty in the inference of the values for the xhalf an ymax model parameters.
NOTE: Setting the standard deviation in artificial error too high can cause nls() to fail to converge and cause this script to crash. The standard deviation of 2 used as the maximum standard deviation is on the edge of what might cause this problem, so this code might occasionally crash. I leave the code in this state to provide an example of the problems you might run into with your own models. Keeping the maximum standard deviation to 1 or less should result in more reliable completion of this script.
Example of a script that performs a Monte Carlo propagation of error through a nonlinear regression with a hyperbolic saturation model
Example of a script that defines a function that performs a Monte Carlo propagation of error through a nonlinear regression with a hyperbolic saturation model
We will be covering the details of R data structures like vectors, lists, matrices, and data frames as we need them. However, the relevant details on R data structures we will use the most have been compiled into a single document. These notes attempt to cover the nature of R data structures that I have seen cause the worst misconceptions or the hardest to find bugs in student's code.
Notes on the basic R data structures used in this class (link to the full page HTML version)
Rmarkdown source code for detailed notes on basic R data structures (download the Rmd file)
Notes on the basic R data structures used in this class (download the postscript PDF file)
Most exercises will not require sophisticated graphing skills, and the materials for this class provide examples using base R graphics. However, base R graphics provide an incredibly flexible graphing tool and understanding just a few fundamentals gives you the capacity to tweak graphs to look exactly as you would like. The following is a document I have started (generated by Rmarkdown) that at least initiates a deeper dive into graphing with base R.
A deeper dive into graphing in base R (link to the full page HTML version)
A deeper dive into graphing in base R (download the fully encapsulated HTML version)
Rmarkdown source code for a deeper dive into graphing in base R (download the Rmd file)
A deeper dive into graphing in base R (download the postscript PDF file)