Instrument Deployment Primary Data
Instrument Deployment Primary Data
This module is under active development
The rapidly evolving concept of generating reusable and curatable "data products" reaches far beyond the scientific endeavor. Most data-driven enterprises are recognizing that at least some types of data sets have intrinsic value outside of the immediate plans for analysis. Furthermore, substantial additional value can be added to a data product when its designer provides additional context or features that anticipate the potential alternative "use cases" for the data beyond the immediate plans for analysis. While the video below is clearly in the context of business applications rather than scientific applications, I think you will recognize some of the common abstractions regarding the design principles that promote reusable and curatable data products (4:42 min).
Once we have at least an abstract grasp on the desirable properties of a data product, we can start thinking of more specific examples of generating useful data products along the typical path of scientific workflow. As the video above highlights, the concept of a data product is highly abstract and the nature of the content of any given data product is going to vary strongly with the context of the use cases for the data.
However, environmental science that depends on observational data does introduce a nearly ubiquitous workflow decision point that is important across the majority of field-based research. That key decision point occurs after we have collected a data set from nature but before we start to look for meaningful patterns in those data. The intrinsic value of a data product beyond the immediate plans for analysis depends strongly on the ability to pivot at that decision point toward alternative analyses. Therefore, we should be designing "primary" data products that objectively capture the data at its entry point into the scientific process, before our perspective on that data may be altered by subjective decisions regarding data analysis. The design of these primary data products should attempt to anticipate use cases for the data beyond our immediate plans for analysis, to give our peers (or our future selves) the information needed to have confidence in reusing the data for multiple purposes.
The goal of this module is to introduce a relatively simple case study of constructing a primary data product that includes time series data generated by deployment of a photosynthetically active radiation (PAR) sensor and data logger. Despite the simplicity of the data set, this exercise will introduce the additional complexities introduced by including the metadata necessary for data product reuse. Automating the management of the data and metadata together will be addressed by introducing the principles and practices of literate programming, where the automated generation of a narrative report describing the workflow for generating the data product is tightly integrated with the automated processing and analysis of the data.
Turing laureate Donald Knuth wrote the seminal work on the principles of literate programming (https://doi.org/10.1093/comjnl/27.2.97). Here is a video of him speaking on the origins of the idea and its tangible influence on software maintenance (5:25 min).
We will be applying literate programming principles both to how we comment our code, as well as how we integrate code into the automated generation of data processing and metadata reports.
One of the goals of generating reproducible data products is to make sure that other researchers (or our future selves) have a full record of the decisions we made in processing and analyzing the data. Before we get into a detailed example of how you might generate a "reproducible" data product, let's provide some detail on what we mean when we say that a data product should be "reproducible" and how the ethic of generating reproducible data is treated by the scientific community (7:20 min).
For our first example of a reproducible data product, we will be focusing on "primary" data at the entry point to the scientific process. Let's review how "primary" data products relate to the planning for the generation of both primary and derived data products at the key decision points through the entire the scientific workflow. We can also revisit the topic of data pipelines in more detail and the benefits of using literate programing to maintain metadata (12:45min).
Slides from videos
The original slides used in these videos are available below.
Click this link to download the Microsoft PowerPoint file
Note that the Google Slides preview window below provides pictures of the slides that do not include the animations in the original file. Please download the original file from the link above if you would like to view the slides with all animations in Microsoft PowerPoint.
The sample data for this case study is from a LICOR Quantum Photosynthetically Active Radiatio (PAR) sensor. The sensor was wired to a Campbell CR300 data logger (1:50 min).
You can start to prepare a pipeline for a primary data product long before you actually go into the field. The following video reviews the parts of the pipeline you might start to build while you are planning for a deployment (14:59 min).
Note that you might also want to have some prepared code in the pipeline before you go in the field, so you can quickly explore the data immediately after downloading the data logger. For the purposes of this learning execise, we will cover how to write that code later.
After we have executed our field deployment and downloaded data files from the instrument, we are ready to write some code to start the steps for processing the data. First let's get our data product ready for some R programming by creating an RStudio project file that will provide a consistent working directory. This time, let's give an example of how to use a ".Rprofile" file to set the working directly for our RStudio project to the root of the pipeline rather than the protocol folder (7:02 min).
Note that setting the working directory to the protocol folder or the root of the pipeline is a personal preference. I prefer the latter because it seems more logical when I need to repeatedly access all four folders of the pipeline. However, as long as you are using relative paths from a consistent working directory, the data product will still be portable and either approach will provide that portability. That said, you will need to take the approach in this video if you want the sample code provided later in this exercise to work with your project without changing all the relative paths.
After the videos above, you should have a .Rprofile file in the protocol of the pipeline that looks something like the following. Note that I have added more thorough comments than those typed in the video.
Now that we are processing data for a primary data product, we should be thinking about how to tightly couple our management of metadata with our management of data. Literate programming tools like Rmarkdown are useful for embedding the code that performs the data processing directly into the metadata narrative describing the workflow. Let's introduce the basics of creating and rendering an Rmarkdown file, including calling the render function from the rmarkdown package ourselves to have more control over its behavior. Note that the rmarkdown package is not included in the base R distribution and you may have to install the package from CRAN (install.packages("rmarkdown")) to follow along with the next video (15:48 min).
With the basics of rendering Rmarkdown working, we can start to experiment with embedding R code into the markdown to automate content. Let's automate adding the time the report was rendered to the end of the report using R code embedded in-line with the markdown code (10:20 min).
Embedding R code in-line with the markdown code is generally only suitable when a single line of code can provide the content to be put into the report. More complex automation of data product generation requires more lines of code, which requires being able to embed multiple lines of R code as a "chunk". Code chunks are also more configurable and thus provide more flexibility in specifying how the code or its output is included in the rendered report (13:57 min).
After the videos above, you should have an Rmarkdown file in the protocol of the pipeline that looks something like the following. Note that I have added more thorough comments than those typed in the video.
You should also have an R script for rendering the Rmarkdown file that looks something like the following. Note that I have added more thorough comments than those typed in the video.
The backbone of most environmental science data products are tables. Working with tables in a computer programming language requires thinking about multi-element data structures with more than one dimension of indices for elements. Let's read data from our PAR sensor deployment using the read.table() function to start getting a feel for the nature of two-dimensional data structures (17:42 min).
Two-dimensional data structures in R are built from various combinations of one-dimensional data structures. This hierarchical structure in the definition of data types means the more complex data structures inherit properties from the simpler data structures that compose them. Therefore, understanding the basics of this inheritance of class definitions in R object-oriented implementations helps you translate your understanding of the simpler one-dimensional data structures like atomic vectors and lists directly to your understanding of more complex two-dimensional data structures like matrices and data frames (3:22 min).
The R data frame (or classes derived from it) is a two-dimensional data structure that is commonly used for tabular scientific data. Therefore, data frame objects will be used more commonly than matrices in this material. However, the R matrix is a simpler data structure that introduces the basic concepts of two-dimensional data. Furthermore, the elements of a data frame can be indexed numerically using the same syntax as the two-dimensional indexing of the elements of a matrix. The following is a brief introduction of how matrix objects are a layer of complexity built upon atomic vector classes in R, which provides an introduction to the fundamentals of class inheritance and two-dimensional data structures. You will ultimately need to know more about how to use matrices if your calculations require multi-dimensional matrix algebra (15:06 min).
Data frame objects cannot be constructed from a single atomic vector like matrix objects because each column has the potential to be a different atomic data type. Therefore, data frames are constructed as a list of multiple atomic vectors, where all the vectors composing the elements of the list are the same length representing the number of rows in the table. Each of the vectors composing an element of the list represents one of the columns in the table (10:12 min).
A more general tutorial on R data structures including matrices and data frames is available in the general resources for the class.
Link to a full page HTML version of the tutorial in general resources
The ability to index a data frame like a matrix and the properties of the data frame class that are inherited from the underlying list class together allow for efficient ways to index data frames that are commonly useful for filtering or wrangling data in scientific data analysis workflow (13:04 min).
A central skill toward being a better programmer and debugger is the ability to understand the specifics of how data structures are changing with each line of a given program. Here is a review of a few more of the R functions that are useful for exploring the structure of a given object and thus give the ability to directly observe how data structures are changing through a program (4:00 min).
Now that we have more detailed understanding of data frames, let's look at the structure of the data frame with PAR data we read from our downloaded data logger files (5:20 min).
After the videos above, you should have a code chunk for reading a data file that looks something like the following. Note that I have added more thorough comments than those typed in the video.
Slides from videos
The original slides used in these videos are available below.
Click this link to download the Microsoft PowerPoint file
Note that the Google Slides preview window below provides pictures of the slides that do not include the animations in the original file. Please download the original file from the link above if you would like to view the slides with all animations in Microsoft PowerPoint.
Let's start by reading the headers of the columns from the data file to allow for more informative names for the columns in the data frame (4:39 min).
Note that the headers = TRUE argument for the read.table() function can import the headers without the extra step if the headers are immediately adjacent to the data in the text file being read.
Now that we have the code for reading the data working, let's think about what we want to show up in the report rendered from the Rmarkdown. We can adjust the chunk settings and add some reporting on the nature of the data frame that was read from the text file (11:27 min).
The file downloaded from the data logger included data from a deployment unrelated to this data product. Those data need to be removed from the data frame. For the sake of a fully reproducible data product, let's be sure the report explicitly includes information about which data were removed and why (13:52 min).
We can now read the second data logger file from the deployment and combine it with the data table read from the first file (9:14 min).
After the videos above, you should have a code chunk for reading a data file that looks something like the following. Note that I have added more thorough comments than those typed in the video.
Before we put more time into building a data product, we should probably make sure that the data are worth saving. This brings us to the topic of performing and reporting the quality assurance of the data. First, we should make sure we understand what R is doing when it reads in the data from a text file into a data frame and whether anything we may not be expecting is happening even though the code is running without errors (12:36 min).
Now that we know that PAR data are not available for some of the records in our data frame, perhaps quality assurance for the data product will benefit from some Rmarkdown to report on exactly which records are missing data (12:28 min).
After the videos above, you should have altered the code chunk for reading the data files to look something like the following. Note that I have added more thorough comments than those typed in the video.
As demonstrated above, we can pretty easily get R to automatically identify problems like NA values or possibly values that are outside ranges that are known to be realistic. However, when it comes to a quick assessment of whether the data have more nuanced problems than can be identified with simple conditional logic, a human reviewing a time series graph of the data is hard to beat. But before we can start generating a time series graph of the data for quality assurance review, we need to convert the timestamp column from a character data type to a numeric data type that will allow R's graphing tools to produce a coherent figure (15:37 min).
Dealing with time as data is an element of computer programming skills that I have seen pose a more pervasive challenge to students. Because R uses the POSIX standard, learning the details of how base R functions deal with time and time zones is fairly transferable to other languages and computing platforms. We'll get more practice with time as data as we progress through generating the visualizations below, but I highly recommend that you review the following tutorial to have a solid foundation on the topic before progressing to the next topic. I would suggest that dealing with time as data is one of those skills where a little extra effort spent solidifying your understanding and best practices now is likely to save you a great deal of stress and debugging time in the future.
A more general tutorial on dealing with time in R is available in the general resources for the class.
Link to a full page HTML version of the tutorial in general resources
Graphing several months of PAR data one week at a time will take some algorithmic thinking and potentially a loop. However, trying to code all of that at once is likely to lead to code that doesn't work due to bugs that are hard to find. Let's take the approach of getting the simplest graph we can imagine working properly, and then we can add complexity one step at a time, testing as we go. This practice incorporates testing with coding and makes it more obvious which code is causing the problem when we inevitably create a bug (14:06 min).
For the sake of quality assurance of PAR data, perhaps we don't want to be thinking about when the sun should be rising and setting in UTC vs. the local time zone where the data were collected. Let's change our perspective on time in the graph by changing the time zone attribute of our POSIX time vector (6:55 min).
Ultimately, we are going to want to customize this graph, so we should get away from using the defaults of the plot() function. Also, we can start cleaning up some details of the margins and the axes (14:38 min).
Now that we can plot one week of data, why not plot all of them? Let's add a loop to plot all of the data in the data product (14:40 min).
Finally, we actually have a rough idea of what these measurements should be relative to the amount of radiation we know is coming from our local star. The last step of our quality assurance may be to add plots of estimates of solar radiation incident on Earth's atmosphere to see if our measurements make sense (16:02 min).
After the videos above, you should have a code chunk for visualizing the data that looks something like the following. Note that I have added more thorough comments than those typed in the video.
If you have studied the content and worked through the exercises in this module, you hopefully now have some of the following abilities:
Be able to modularize scientific workflow by identifying the key decision points where data products should be generated.
Be able to organize data products with a portable pipeline folder structure that clearly differentiates source data from protocol and derived data.
Be able to start the population of a pipeline for creation of an instrument deployment primary data product at the initiation of planning for the deployment.
Be able to use R code embedded as in-line text or chunks to automatically generate components of human readable reports using Rmarkdown.
Be able to describe the structure and use of matrices and data frames, the two-dimensional data types most commonly used in R.
Be able to use R to read, filter, and integrate data from text files downloaded from instruments and be able to use Rmarkdown to automate the generation of the report on the processing of the data.
Be able to use R to identify any records where data in a time series are unavailable and be able to use Rmarkdown to report the location of those records.
Be able to create a visualization of the time series data to assist in a quality assurance assessment and to provide a summary visualization component for users of the data product.