Mathematical Sensitivity Analysis
Mathematical Sensitivity Analysis
This module is under active development
Before diving into more detailed materials on coding, we should address the role of use of large language models (LLMs) in computer programming. First, I do not like to use the term "intelligence" for what LLMs do, artificial or otherwise. I think of intelligence as an understanding of causality that goes beyond purely empirical pattern matching in a language based on training from a large collection of past narratives. Second, use of large language models will undoubtedly help you learn programming skills faster and will also allow you to apply your programming skills to generate programs in less time. While this class does not currently focus on LLM use, I will be working to add exercises that illustrate its use as I find effective examples.
However, when it comes to getting a computer to do what you want it to for more complex tasks like data analysis, I also think that those who have a fundamental understanding of how computers and computer programs work will will always have an advantage over those who don't, regardless of whether LLMs are in use or not and whether a human is actually writing the code or not. Of course, this statement is somewhat dependent of the use case of the program being generated. But on average across use cases for more complex programs, I think it is probably true. A software engineers job is so many more things than just typing code into a text editor, and LLMs make their job much easier but do not replace their need to understand the fundamentals of programming. The video below provides the perspective of a practicing software engineer. Note the video has the tiniest bit of (perhaps justifiable) explicit language. Alberta Tech has made both instructive and satirical videos about use of large language models and "vibe coding" that provide an entertaining perspective from someone who actually understands how software development is done, which I think is much more useful than the exaggerations being propagated by CEO's trying to sell their product to large companies and investors (8:27 min).
Regardless of how code is generated, its integrity is paramount. If you generate a program with an LLM, you should either have the programming experience to review the code yourself or have the tool extensively comment the code so you can perform a pseudo code review without being directly familiar with the programming language. Alternatively or in addition, you should take the extra step of having the LLM generate incremental testing scenarios that ensures that the code really does what you intended when entering the prompt. When a human writes code, especially in a scripting language, the process usually inherently includes at least some incremental testing to be sure elements of the program perform as expected before they are trusted. This process is highly formalized by professional software engineers with formal unit and system tests. This doesn't happen when an LLM generates code, unless you force it to show you a series of tests in the sequence of prompts you use to generate a program. Using an LLM does not absolve you of the responsibility to ensure the integrity of your analysis. Regardless of the tools used, good programmers are not those who are free from mistakes; good programmers are those who catch their mistakes the earliest.
This module will use a mathematical sensitivity analysis as an analytical data product case study for learning some of the basics of data structures, algorithms, graphics engines, and data flow pipelines in programming. Before we get started, I want to emphasize that learning fundamental programming skills and learning a given computer language are not the same thing. Yes, you need to pick an initial language to adopt on the way to learning fundamental programming skills to get any practice. But if you take the time to learn fundamental programming concepts in addition to the syntax of that language for accomplishing specific goals, this understanding will transfer to any additional language you wish to learn in the future and allow you to learn that language much faster. We will be using the scripting language R for exercises mostly because it is very easy to get started in R and get to the point where you are writing meaningful programs quickly. But we will be doing so in the spirit of learning the fundamentals of programming that will transfer to a much more generally useful scripting language for software development like python. The following video is a nice summary of the considerations in using R or python for different use cases. I think it ultimately suggests that if you want to build a career around automated data analysis skills, you should probably have good fundamental programming skills and learn how to understand programs in both languages. After all, you can write an R program that translates R data structures into python data structures and runs a python program for part of its function, and vice versa (7:06 min).
Be able to describe the basics of how a microprocessor based system works and the types of programming languages used to control it.
Understanding the details of how a computer works is not necessary for programming. However, understanding the basics of how your computer's memory works helps reveal why data typing is such an important topic early in computer programming training. Also, understanding of how the lowest level machine language programs are executed by the processor's logic circuitry helps reveal the differences in efficiency between the customized machine language programs from compiled languages versus the higher level scripting language programs that are ultimately executed on the processor by a precompiled interpreter program (3:56 min).
With a basic understanding of how the memory works, we can review the fundamentals of how the processor runs a program and in particular an individual instruction of machine language (5:09 min).
The lowest level human readable language that can be used to generate the machine language instructions that run directly on the processor is called assembly language. Assembly language is specific to the instruction set for a given processor and is essentially just a one-to-one translation from the series of bits representing a given instruction and a narrative word or phrase for that instruction. Assemly code is rarely directly generated by humans except for educational exercises. If customized machine language is needed, it is generally created from a compiled programming language like C or Fortran. For automating workflows where memory use or processor efficiency is less critical or can be handled by external programs, programming in scripting languages that are executed by precompiled interperters like R or python engines are popular. The following video summarized where compiled or scripting languages may be more appropriate and describes why R is used for the activities here (5:48 min).
The above video groups python in with scripting languages, which is not altogether untrue but is perhaps an oversimplification. Whether a program is scripted or compiled is really more about the mode in which it is run on the processor than necessarily about the language used to write the program. Compilers are available that are able to translate python code into a customized machine language program, either for execution on a python virtual machine or even for native execution on the processor. Perhaps it is better to think of programming as existing on a continuum between low and high level modes of operation, where the lowest level of human-readable program is assembly language that directly reflects the processor's instruction set. From there, progressively higher levels of programs insert more and more layers of software needed to either compile the program to machine language or execute the program with a precompiled interpreter. Regardless of the level of program or language used, computing always comes down to execution of machine language on the processor, and the efficiency of a program's execution is going to depend on the appropriate matching of the program's use case with the computation technology applied (if the task is computationally intensive enough for efficiency to matter).
Unless you are using packages that use compiled C programs optimized for particular data wrangling tasks (e.g. dplyr), any computations carried out directly with R operations and program control structures are generally quite inefficient relative to the same algorighm running directly on your processor. However, on a modern computer, you have to be dealing with pretty large data sets before you will be able to tell the difference without a benchmarking timer. While these learning materials are not strongly focused on optimized code, we will occasionally make note of programming practices that generally allow programs to run faster. In general, the more your R code is using C or C++ programs to manipulate native C or C++ data structures (e.g. using R interface packages like dplyr), the faster your R program will run.
Slides from videos
The original slides used in these videos are available below.
Click this link to download the Microsoft PowerPoint file
Note that the Google Slides preview window below provides pictures of the slides that do not include the animations in the original file. Please download the original file from the link above if you would like to view the slides with all animations in Microsoft PowerPoint.
The remainder of this module assumes that you have R and RStudio installed. You will need to have R installed first, so the RStudio installation can find the current R installation and automatically create the necessary links to it. The links to the R installation used by RStudio can easily be changed in RStudio's Global Options under the Tools menu.
Here are the direct links for installing R and RStudio, in the order in which they should be installed:
The Comprehensive R Archive Network (CRAN) for downloading and installing R (among many other R packages).
Open source RStudio Desktop for downloading and installing RStudio (should install R first).
This module also assumes that plentiful public resources are available for helping with the installation of these software packages on your operating system of choice, and does not provide specific narrative or videos on this topic.
Many programming classes are focused on programming skills alone and do not necessarily provide training on how to organize the files associated with running the programs to achieve a task. While the details of a folder structure are highly dependent on the use case driving development of the program, the majority of environmental data processing and analysis tasks might be generalized as a pipeline for data flow through four primary categories of files. Therefore, an abstract folder structure for organizing the files defining a given segment of environmental analysis workflow might be composed of a root folder with the following four subdirectories: 1_input, 2_protocol, 3_incremental, and 4_product. The folder names have a numerical prefix to ensure that the default alphabetical listing typical for file system exploration tools will result in a logical order corresponding to data flow. Simply categorizing workflow files based on their presence in these folders provides inherent metadata toward the ability of others to understand the flow of data through the pipeline.
Descriptions of the intended content of these folders is provided below in simple HTML format. You might consider routinely including an HTML file similar to this in the root directory of the pipeline to remind you of the purpose of organizing files into these directories to clarify data flow. A modified version of this HTML file may be useful to include in a shared data product to describe the folder structure to a peer attempting to use it, though the bulk of the detailed metadata and description of folder contents should be located in the product directory. These learning materials do not include training on the HTML text-based language for rendering formatted content in web browsers, but this code provides an example of the HTML tags used to designate simple paragraphs and unnumbered lists in the body of an HTML document. Comparing this HTML code with markdown code from pervious exercises provides an example of differences in syntax among formatting languages to accomplish similar goals like formatting paragraphs and lists. Understanding that these formatting languages have many common goals despite different syntaxes is a key to learning new formatting languages quickly and translating between them.
Depending on the default configuration of the web browser, this HTML code will render to the following formatted text.
With this strategy in mind, lets work through building the basic structure of a pipeline for our sensitivity analysis project and creating an RStudio project file in the protocol directory that will help us stay organized when working on the R code performing the workflow. Consistent practice of using a project file to open RStudio for work on a given pipeline will eventually facilitate the use of relative paths in code that will enhance the portability and reproducibility of the data product (i.e. the ability to move the entire folder structure to another computer and it will still work). Let's start by creating a root directory for a data product and add the pipeline folders to the root (4:18 min).
Once the pipeline folders are in place, we can create an RStudio project file in the protocol folder that will make it easier to bring up a unique RStudio session for work on the R code that generates the data product. This is the first step toward working with the data product in a way that will ultimately make it more portable and reproducible (4:55 min).
Before we start programming, this is a good time to review the more commonly used tools in the RStudio integrated development environment in more detail, including thinking about the different parts of your computer's software stack with which you are interacting when using those tools (18:22 min).
Note this video is a little out of date in that the file system on a current Apple computer is likely the APFS file system rather than HFS+.
Slides from videos
The original slides used in these videos are available below.
Click this link to download the Microsoft PowerPoint file
Note that the Google Slides preview window below provides pictures of the slides that do not include the animations in the original file. Please download the original file from the link above if you would like to view the slides with all animations in Microsoft PowerPoint.
Retention of material while learning how to program computers is aided by active learning exercies that have a meaningful goal for what we want the computer to do. For this exercise, we are going to learn the programming skills necessary to create an analytical data product that provides a sensitivity analysis of a commonly used mathematical model. We are going to write a program that performs the calculations of the model with several different parameter values, then creates a visualization that allows us to assess how altering that parameter value influenced the behavior of the model. This is an extremely common exercise in scientific analysis when we are trying to determine whether a given mathematical model will be appropriate for describing patterns in our data. Let's review the theoretical origins of the exponential decay model, which is prolifically used across scientific disciplines to describe many different source-limited behaviors in nature. In this case we'll think about it in the context of first-order kinetics from the rate theory of chemistry (6:49 min).
To get R to do calculations with this model that will help us understand how it predictions concentration will change over time, we need to have a way to tell R to create a series of times at which we want to know the concentration. Therefore, the first step to writing R code is to understand how to assign values to variables in the global environment, and how to query or manipulate the variables contained within the global environment (7:53 min).
The most fundamental and irreducible data type in R is the atomic vector. Even when we assign a single value to a variable, R is still creating at atomic vector to hold that value; but that vector is composed of a single element containing the assigned value. Let's briefly review a few of the ways to create vectors with more than one element, with careful attention to the data type being assumed by R in allocating space in RAM for the value (12:04 min).
Now that you are familiar with creating numeric vectors, we can start a script that will execute a sensitivity analysis of exponential decay in the context of first-order kinetics (14:19 min).
Now that we have a vector of times, we need to do some math with it to calculate concentrations. R needs to have special rules for doing mathematical operations with variables representing numeric vectors because each vector in the equation may have a different number of elements. The following video shows examples of how R will recycle the values in shorter vectors to match the number of elements in the longer vector when doing calculations (6:00 min).
Understanding the recycling of values of shorter vectors in R calculations is critical to understanding how the exponential decay equation results in a vector of values the same as the length of the time vector (9:54 min).
When a compound data structure you are using has more than one element, the ability to extract specific elements from those data structures for calculations or other data processing is often necessary. R has a wide variety of different ways you can index a vector to create a new vector with only specific elements of the original (11:59 min).
Let's work through a simple example of the application of vector indexing by using it to filter the data we consider in the exponential decay sensitivity analysis (5:41 min).
Note that NA values are generally ignored by R's graphing functions. Try plotting the results of the above code to see how the graph is now truncated to the fist 99% of reactant consumption.
After the videos above, you should have an R script in the protocol of the pipeline that looks something like the following. Note that I have added more thorough comments than those typed in the video.
Slides from videos
Exponential decay derivation
The original slides used in these videos are available below.
Click this link to download the Microsoft PowerPoint file
Note that the Google Slides preview window below provides pictures of the slides that do not include the animations in the original file. Please download the original file from the link above if you would like to view the slides with all animations in Microsoft PowerPoint.
R introductory material
The original slides used in these videos are available below.
Click this link to download the Microsoft PowerPoint file
Note that the Google Slides preview window below provides pictures of the slides that do not include the animations in the original file. Please download the original file from the link above if you would like to view the slides with all animations in Microsoft PowerPoint.
Slides from videos
The original slides used in these videos are available below.
Click this link to download the Microsoft PowerPoint file
Note that the Google Slides preview window below provides pictures of the slides that do not include the animations in the original file. Please download the original file from the link above if you would like to view the slides with all animations in Microsoft PowerPoint.
A general tutorial on R data structures
A more general tutorial on R data structures is available in the general resources for the class.
Link to a full page HTML version of the markdown tutorial in general resources