Env Data Mgmt - Mathematical Sensitivity Analysis

Mathematical Sensitivity Analysis

Before diving into more detailed materials on coding, we should address the role of use of large language models (LLMs) in computer programming. First, I do not like to use the term "intelligence" for what LLMs do, artificial or otherwise. I think of intelligence as an understanding of causality that goes beyond purely empirical pattern matching in a language based on training from a large collection of past narratives. Second, use of large language models will undoubtedly help you learn programming skills faster and will also allow you to apply your programming skills to generate programs in less time. While this class does not currently focus on LLM use, I will be working to add exercises that illustrate its use as I find effective examples.

However, when it comes to getting a computer to do what you want it to for more complex tasks like data analysis, I also think that those who have a fundamental understanding of how computers and computer programs work will will always have an advantage over those who don't, regardless of whether LLMs are in use or not and whether a human is actually writing the code or not. Of course, this statement is somewhat dependent of the use case of the program being generated. But on average across use cases for more complex programs, I think it is probably true. A software engineers job is so many more things than just typing code into a text editor, and LLMs make their job much easier but do not replace their need to understand the fundamentals of programming. The video below provides the perspective of a practicing software engineer. Note the video has the tiniest bit of (perhaps justifiable) explicit language. Alberta Tech has made both instructive and satirical videos about use of large language models and "vibe coding" that provide an entertaining perspective from someone who actually understands how software development is done, which I think is much more useful than the exaggerations being propagated by CEO's trying to sell their product to large companies and investors (8:27 min).

Regardless of how code is generated, its integrity is paramount. If you generate a program with an LLM, you should either have the programming experience to review the code yourself or have the tool extensively comment the code so you can perform a pseudo code review without being directly familiar with the programming language. Alternatively or in addition, you should take the extra step of having the LLM generate incremental testing scenarios that ensures that the code really does what you intended when entering the prompt. When a human writes code, especially in a scripting language, the process usually inherently includes at least some incremental testing to be sure elements of the program perform as expected before they are trusted. This process is highly formalized by professional software engineers with formal unit and system tests. This doesn't happen when an LLM generates code, unless you force it to show you a series of tests in the sequence of prompts you use to generate a program. Using an LLM does not absolve you of the responsibility to ensure the integrity of your analysis. Regardless of the tools used, good programmers are not those who are free from mistakes; good programmers are those who catch their mistakes the earliest.

This module will use a mathematical sensitivity analysis as an analytical data product case study for learning some of the basics of data structures, algorithms, graphics engines, and data flow pipelines in programming. Before we get started, I want to emphasize that learning fundamental programming skills and learning a given computer language are not the same thing. Yes, you need to pick an initial language to adopt on the way to learning fundamental programming skills to get any practice. But if you take the time to learn fundamental programming concepts in addition to the syntax of that language for accomplishing specific goals, this understanding will transfer to any additional language you wish to learn in the future and allow you to learn that language much faster. We will be using the scripting language R for exercises mostly because it is very easy to get started in R and get to the point where you are writing meaningful programs quickly. But we will be doing so in the spirit of learning the fundamentals of programming that will transfer to a much more generally useful scripting language for software development like python. The following video is a nice summary of the considerations in using R or python for different use cases. I think it ultimately suggests that if you want to build a career around automated data analysis skills, you should probably have good fundamental programming skills and learn how to understand programs in both languages. After all, you can write an R program that translates R data structures into python data structures and runs a python program for part of its function, and vice versa (7:06 min).

Topics covered in this module

Microprocessors and programming

An RStudio pipeline project

Starting an R script

R lists and saving data

Reading and visualizing data

Iterative algorithms

Fully automating the data product

Review of module objectives

Other useful materials

Microprocessors and programming

Understanding the details of how a computer works is not necessary for programming. However, understanding the basics of how your computer's memory works helps reveal why data typing is such an important topic early in computer programming training. Also, understanding of how the lowest level machine language programs are executed by the processor's logic circuitry helps reveal the differences in efficiency between the customized machine language programs from compiled languages versus the higher level scripting language programs that are ultimately executed on the processor by a precompiled interpreter program (3:56 min).

With a basic understanding of how the memory works, we can review the fundamentals of how the processor runs a program and in particular an individual instruction of machine language (5:09 min).

The lowest level human readable language that can be used to generate the machine language instructions that run directly on the processor is called assembly language. Assembly language is specific to the instruction set for a given processor and is essentially just a one-to-one translation from the series of bits representing a given instruction and a narrative word or phrase for that instruction. Assembly code is rarely directly generated by humans except for educational exercises. If customized machine language is needed, it is generally created from a compiled programming language like C or Fortran. For automating workflows where memory use or processor efficiency is less critical or can be handled by external programs, programming in scripting languages that are executed by precompiled interpreters like R or python engines are popular. The following video summarized where compiled or scripting languages may be more appropriate and describes why R is used for the activities here (5:48 min).

The above video groups python in with scripting languages, which is not altogether untrue but is perhaps an oversimplification. Whether a program is scripted or compiled is really more about the mode in which it is run on the processor than necessarily about the language used to write the program. Compilers are available that are able to translate python code into a customized machine language program, either for execution on a python virtual machine or even for native execution on the processor. Perhaps it is better to think of programming as existing on a continuum between low and high level modes of operation, where the lowest level of human-readable program is assembly language that directly reflects the processor's instruction set. From there, progressively higher levels of programs insert more and more layers of software needed to either compile the program to machine language or execute the program with a precompiled interpreter. Regardless of the level of program or language used, computing always comes down to execution of machine language on the processor, and the efficiency of a program's execution is going to depend on the appropriate matching of the program's use case with the computation technology applied (if the task is computationally intensive enough for efficiency to matter).

Unless you are using packages that use compiled C programs optimized for particular data wrangling tasks (e.g. dplyr), any computations carried out directly with R operations and program control structures are generally quite inefficient relative to the same algorighm running directly on your processor. However, on a modern computer, you have to be dealing with pretty large data sets before you will be able to tell the difference without a benchmarking timer. While these learning materials are not strongly focused on optimized code, we will occasionally make note of programming practices that generally allow programs to run faster. In general, the more your R code is using C or C++ programs to manipulate native C or C++ data structures (e.g. using R interface packages like dplyr), the faster your R program will run.

Slides from videos

The original slides used in these videos are available below.

Click this link to download the Microsoft PowerPoint file

Note that the Google Slides preview window below provides pictures of the slides that do not include the animations in the original file. Please download the original file from the link above if you would like to view the slides with all animations in Microsoft PowerPoint.

slides_microprocessor_programming_pic.pptx

An RStudio pipeline project

The remainder of this module assumes that you have R and RStudio installed. You will need to have R installed first, so the RStudio installation can find the current R installation and automatically create the necessary links to it. The links to the R installation used by RStudio can easily be changed in RStudio's Global Options under the Tools menu.

Here are the direct links for installing R and RStudio, in the order in which they should be installed:

The Comprehensive R Archive Network (CRAN) for downloading and installing R (among many other R packages).
Open source RStudio Desktop for downloading and installing RStudio (should install R first).

This module also assumes that plentiful public resources are available for helping with the installation of these software packages on your operating system of choice, and does not provide specific narrative or videos on this topic.

Many programming classes are focused on programming skills alone and do not necessarily provide training on how to organize the files associated with running the programs to achieve a task. While the details of a folder structure are highly dependent on the use case driving development of the program, the majority of environmental data processing and analysis tasks might be generalized as a pipeline for data flow through four primary categories of files. Therefore, an abstract folder structure for organizing the files defining a given segment of environmental analysis workflow might be composed of a root folder with the following four subdirectories:

1_input
2_protocol
3_incremental
4_product.

The folder names have a numerical prefix to ensure that the default alphabetical listing typical for file system exploration tools will result in a logical order corresponding to data flow. Simply categorizing workflow files based on their presence in these folders provides inherent metadata toward the ability of others to understand the flow of data through the pipeline.

Descriptions of the intended content of these folders is provided below in simple HTML format. You might consider routinely including an HTML file similar to this in the root directory of the pipeline to remind you of the purpose of organizing files into these directories to clarify data flow. A modified version of this HTML file may be useful to include in a shared data product to describe the folder structure to a peer attempting to use it, though the bulk of the detailed metadata and description of folder contents should be located in the product directory. These learning materials do not include training on the HTML text-based language for rendering formatted content in web browsers, but this code provides an example of the HTML tags used to designate simple paragraphs and unnumbered lists in the body of an HTML document. Comparing this HTML code with markdown code from pervious exercises provides an example of differences in syntax among formatting languages to accomplish similar goals like formatting paragraphs and lists. Understanding that these formatting languages have many common goals despite different syntaxes is a key to learning new formatting languages quickly and translating between them.

Link to the HTML file for the folder description below

Depending on the default configuration of the web browser, this HTML code will render to the following formatted text.

With this strategy in mind, lets work through building the basic structure of a pipeline for our sensitivity analysis project and creating an RStudio project file in the protocol directory that will help us stay organized when working on the R code performing the workflow. Consistent practice of using a project file to open RStudio for work on a given pipeline will eventually facilitate the use of relative paths in code that will enhance the portability and reproducibility of the data product (i.e. the ability to move the entire folder structure to another computer and it will still work). Let's start by creating a root directory for a data product and add the pipeline folders to the root (4:18 min).

Once the pipeline folders are in place, we can create an RStudio project file in the protocol folder that will make it easier to bring up a unique RStudio session for work on the R code that generates the data product. This is the first step toward working with the data product in a way that will ultimately make it more portable and reproducible (4:55 min).

If the pipeline is a folder structure you use often, you might want to create a template that can be easily coped to efficiently start a new pipeline project without manually creating all the directories and the project file. Here is a zipped archive of an example of a pipeline template.

Link to empty pipeline template zip archive

Before we start programming, this is a good time to review the more commonly used tools in the RStudio integrated development environment in more detail, including thinking about the different parts of your computer's software stack with which you are interacting when using those tools (18:22 min).

Note this video is a little out of date in that the file system on a current Apple computer is likely the APFS file system rather than HFS+.

Slides from videos

The original slides used in these videos are available below.

Click this link to download the Microsoft PowerPoint file

slides_stack_pic.pptx

Starting an R script

Retention of material while learning how to program computers is aided by active learning exercies that have a meaningful goal for what we want the computer to do. For this exercise, we are going to learn the programming skills necessary to create an analytical data product that provides a sensitivity analysis of a commonly used mathematical model. We are going to write a program that performs the calculations of the model with several different parameter values, then creates a visualization that allows us to assess how altering that parameter value influenced the behavior of the model. This is an extremely common exercise in scientific analysis when we are trying to determine whether a given mathematical model will be appropriate for describing patterns in our data. Let's review the theoretical origins of the exponential decay model, which is prolifically used across scientific disciplines to describe many different source-limited behaviors in nature. In this case we'll think about it in the context of first-order kinetics from the rate theory of chemistry (6:49 min).

To get R to do calculations with this model that will help us understand how it predictions concentration will change over time, we need to have a way to tell R to create a series of times at which we want to know the concentration. Therefore, the first step to writing R code is to understand how to assign values to variables in the global environment, and how to query or manipulate the variables contained within the global environment (7:53 min).

The most fundamental and irreducible data type in R is the atomic vector. Even when we assign a single value to a variable, R is still creating at atomic vector to hold that value; but that vector is composed of a single element containing the assigned value. Let's briefly review a few of the ways to create vectors with more than one element, with careful attention to the data type being assumed by R in allocating space in RAM for the value (12:04 min).

A more general tutorial on R data structures including atomic vectors is available in the general resources for the class.

Link to a full page HTML version of the tutorial in general resources

Link to downloadable versions of the tutorial in general resources, including the markdown source code for it

Now that you are familiar with creating numeric vectors, we can start a script that will execute a sensitivity analysis of exponential decay in the context of first-order kinetics (14:19 min).

Now that we have a vector of times, we need to do some math with it to calculate concentrations. R needs to have special rules for doing mathematical operations with variables representing numeric vectors because each vector in the equation may have a different number of elements. The following video shows examples of how R will recycle the values in shorter vectors to match the number of elements in the longer vector when doing calculations (6:00 min).

Understanding the recycling of values of shorter vectors in R calculations is critical to understanding how the exponential decay equation results in a vector of values the same as the length of the time vector (9:54 min).

When a compound data structure you are using has more than one element, the ability to extract specific elements from those data structures for calculations or other data processing is often necessary. R has a wide variety of different ways you can index a vector to create a new vector with only specific elements of the original (11:59 min).

Let's work through a simple example of the application of vector indexing by using it to filter the data we consider in the exponential decay sensitivity analysis (5:41 min).

Note that NA values are generally ignored by R's graphing functions. Try plotting the results of the above code to see how the graph is now truncated to the fist 99% of reactant consumption.

After the videos above, you should have an R script in the protocol of the pipeline that looks something like the following. Note that I have added more thorough comments than those typed in the video.

Slides from videos

Exponential decay derivation

The original slides used in these videos are available below.

Click this link to download the Microsoft PowerPoint file

slides_exp_decay_pic.pptx

R introductory material

The original slides used in these videos are available below.

Click this link to download the Microsoft PowerPoint file

slides_R_intro_pic.pptx

R lists and saving data

Once you have a grasp on the fundamental data types in a computer programming language, you quickly get to a point where you need some way to group some of these simple data structures together into a more complex data type. In this case, we have perhaps reached a point where we want to save the results of our analysis to a file in the incremental folder for later processing, and the ability to group variables together into a single object would be convenient. The list data type in R is an abstract version of a vector (i.e. non-atomic) allowing you to keep track of a vector of references to other R objects (6:35 min).

Each reference in a list can point to an object of a different data type, which is why lists are commonly used to build more complex data structures from a combination of simpler data structures (3:36 min).

Just like any other vector, the individual elements of a list can be assigned a name that can later be used as an alternative to the numerical index for accessing individual elements of the list. Naming the elements of a list is a more common practice than naming the elements of an atomic vector because using names rather than integers for the elements of a list helps with a more intuitive way to remember the data type referenced by each element (8:03 min).

A more general tutorial on R data structures including list mode vectors is available in the general resources for the class.

Link to a full page HTML version of the tutorial in general resources

Link to downloadable versions of the tutorial in general resources, including the markdown source code for it

Note that following along with the next step depends on having a project configured similar to the completion of the exercise in the previous topic.

Now that we know the nature of lists and how to create them, lets use a list to collect and save the results of our sensitivity analysis. First we need to create the list object that we want to save for later use (7:12 min).

Once a list is created, we can save it to a binary file that can be easily reloaded back into R in a separate script. Reading data from or writing data to other files in the pipeline folders is where the value of having a consistent working directory based on using a project file to open RStudio becomes more clear. This step therefore provides an example of the use of relative file system paths for writing a data file to the incremental pipeline folder. In particular, you will need to understand the "../" notation used for specifying the parent of the current working directory at the beginning of the relative path "../3_incremental/results.RData" used to designate the data file to be saved to permanent memory (6:59 min).

Strategies like using relative paths from designated working directories are critical for the code included in a given data product to run correctly regardless of where the data product is located on a file system. Any file input or output code that uses full paths to files will inherently break if the data product is moved to a different computer or even to a different place in the file system on the same computer. If those paths are sprinkled throughout the code for a complex analysis, the task of correcting all those paths to get the code to run again after it is moved can be quite tedious, time consuming, and error prone. I HIGHLY recommend not doing that to your future self, and using relative paths is one of the most widely accepted practices for avoiding it.

Slides from videos

The original slides used in these videos are available below.

Click this link to download the Microsoft PowerPoint file

slides_lists_pic.pptx

Reading and visualizing data

The results of a version of our analysis have been written to a list stored in a binary RData file in the incremental folder. If we are running a new script for the purpose of generating a product with the visualization of the analysis, we will need to know how to read those data back into R (6:29 min).

Now that we have the results data, let's review how to make a plot, but this time lets choose a graphics device that will allow us to create the plot as a vector graphics file that is part of the data product (8:15 min).

Let's start learning how to take more control over how the graph looks. The first step is to understand how to define the coordinate system you will use for putting symbols at particular locations on the canvas, and the first step defining the coordinate system is to change the graphics parameters that define where the four edges of the rectangular graphing area are located relative to the four edges of the canvas (7:47 min).

Note that I am using the Sumatra PDF viewer in these videos, which allows me to leave the PDF file open while I am changing the graph with the code. The Sumatra PDF viewer will also automatically update the image when the source PDF file is rewritten. This will not work with all PDF viewers. If you have the PDF file open in Adobe's viewer, for example, the R code will generate an error if trying to write to the PDF file while it is open in Adobe. If your code breaks for this reason, or for any reason that causes it to break before the dev.off() function is called, you may have to call dev.off() manually in the console to get back to where R can write to the PDF file again. Alternatively, you can call the graphics.off() function to completely reset the graphics engine.

After you establish the location of the graphing area, the coordinate system is established by defining the coordinate values associated with the extremes of the graphing area. The x limits define the bounds for the left and right edges of the graphing area and the y limits define the bounds for the bottom and top edges of the graphing area (8:59 min).

Once we have axes, we can add data to the figure and then define axes labels to give the data meaning (12:18 min).

A more general tutorial on base R graphics is available in the general resources for the class.

Link to a full page HTML version of the tutorial in general resources

Link to downloadable versions of the tutorial in general resources, including the markdown source code for it

Note that a graph just like this can be created using different arguments to the plot() function used earlier to first demonstrate the graphing capacity of R. However, if you learn how to break a call to plot() down into the component functions it is using to create the graph, then you will be able to be much more creative and take much more control over the rendering of the figure. We are not yet taking advantage of features beyond what plot() can do when using the appropriate arguments, but this exercise is designed to train you to understand the lower level graphing functions that are eventually necessary to take advantage of highly granular control over figures.

Creating the same plot by aggregating several function calls above into a single call to plot() would look like the following. While this code is indeed more compact, I find it a bit harder to read in terms of understanding what is actually happening when the plot() function is called. Also, the following version has no computational benefits other than the code taking a little bit less space to store. With either version, the same functions are ultimately being called to generate the graph.

Iterative algorithms

Because multi-element data structures like atomic vectors or lists are a common component of nearly all programming languages, designing iterative algorithms for dealing with each element of these data structures is commonly an early topic in computer programming training. Let's alter our first-order exponential decay example to repeat the analysis for several different values of the initial concentration as a relatively simple example of designing iterative algorithms with a program control structure called a for loop (15:20 min).

We have updated our analysis to include the results of three different initial concentrations, but our script for generating a graph of the results is not expecting this new heirarchical data structure. We need to use more loops and iterative algorithms to alter the product.R script to be able to graph the results for the three different initial concentrations on the same axes (20:23 min).

The line of code below was used in this video to quickly create a plot showing the 26 integer symbol codes available for the pch argument in base R graphics. Note you need to subtract one from the value on the x axis to get the symbol code from this plot, because the numbering of the symbol codes starts at zero rather than one.

Since we have a graph with three different symbols on it, a legend specifying the meaning of those symbols would be nice (23:39 min).

The sensitivity analysis now gives us an idea of how changing the initial concentration affects the exponential decay curve, but what about the rate coefficient? Let's expand the sensitivity analysis to include a full factorial combination of variation in the initial concentration and the first-order rate coefficient. This means we need to learn about nested loops and we need to add another to the hierarchy of the data structure for our results (9:29 min).

Now we need to update the code for generating the graphs for the more complex sensitivity analysis (25:48 min).

Fully automating the data product

Let's put some final touches on our sensitivity analysis data product. First, our data product does not include plots of all the data our sensitivity analysis has generated. Let's add to our iterative algorithm to generate plots of the filtered data as well as the unfiltered data we are already plotting (16:45 min).

If we want to fully automate the workflow for this sensitivity analysis, we may want to be able to reconfigure the analysis without changing the code. That means we need to read the configuration of the analysis from input files rather than hardcoding the configuration in our analysis script. This gives us some additional files we will want to put in the the input folder of our pipeline to define the configuration of the sensitivity analysis (21:39 min).

We now have a workflow for generating the sensitivity analysis that may be considered fully automated. We can configure the sensitivity analysis from the text files in the input folder of the pipeline and rerun the analysis without having to change any code. To convince you of this, I will change the configuration and rerun the analysis script directly from R using the command line. You do not need to know how to do this to be functional at using R for your data analysis. Out of convenience, I seldom run R programs without running them from R Studio. This is just a demonstration of what it means to have a fully automated workflow that can be executed without any of the tools in RStudio (4:09 min).

Use the following link to download a zip archive file with the full pipeline for the sensitivity analysis created in this module.

Link to zip file of the pipeline

Review of module objectives

If you have studied the content and worked through the exercises in this module, you hopefully now have some ability to do the following:

Be able to describe the basics of how a microprocessor based system works and the types of programming languages used to control it.
Be able to define the software stack dependencies of the various tools provided by the RStudio integrated development environment.
Be able to build an RStudio pipeline project for the organization of portable scientific workflow based on R scripts.
Be able to write modular R scripts for performing a mathematical sensitivity analysis.
Be able to make informed use of the R atomic vector and list data structures in the sensitivity analysis computer program.
Be able to read and write R binary data files stored in permanent memory as part of the analytical workflow.
Be able to use the fundamental R graphics engine to make highly customizable data visualizations using basic scatter plots.
Be able to design iterative algorithms to repeat calculations of a model with different parameter values and repeat the addition of graphical elements to a figure representing the iterated calculations.
Be able to read configuration of a workflow from simple text input files so an analysis can be reconfigured and executed without changing code.

Other useful materials

A general tutorial on R data structures

A more general tutorial on R data structures is available in the general resources for the class.

Link to a full page HTML version of the tutorial in general resources

Link to downloadable versions of the tutorial in general resources, including the markdown source code for it

A general tutorial on base R graphics

A more general tutorial on base R graphics is available in the general resources for the class.

Link to a full page HTML version of the tutorial in general resources

Link to downloadable versions of the tutorial in general resources, including the markdown source code for it

A parallel data product developed with VS Code and python

A version of the mathematical sensitivity analysis case study has been implemented in python. The goal is to demonstrate how the programming concepts of multi-element data structures and iterative algorithms are universal to most computer programming languages.

Link to a parallel data product developed in VS Code and python

Page updated

Google Sites

Report abuse