You have doubtless seen a simple chart of scientific data like this many times, but how much do you know about the process of gathering and preparing scientific data? When I began my career in chemistry research, I was surprised by the level of critical thinking and scrutiny that goes into the data summed up by even the simplest of figures. If you’re curious, I’ll walk you through what creating this chart, which shows the yield of a chemical reaction at different temperatures over time, was like for me.
In my work, I am transforming molecules called carboxylic acids into another type of molecule called a thioester. We hope that, once our synthetic method is published, companies will use it to make pharmaceuticals.
This chemical reaction is performed using a reagent designed by a member of my research team. My team has already established that this reagent works through various experiments. Now that we know it works, we want to find the best conditions under which to run these reactions. Every scientific effort begins with an observation followed by a hypothesis, so, within that goal, I needed to find a specific research question before I could begin running experiments.
Chemists have to balance multiple priorities when optimizing reactions. Especially if a certain material used in a reaction is expensive, hard to make, or environmentally damaging, it is good to use as little of it as possible; but if the ratio of starting materials is too low, sometimes they won’t react completely. Keeping temperatures low diminishes energy use and prevents unstable molecules from decomposing; but some reactions proceed slowly or not at all unless they receive enough energy from their surroundings. Often, running a reaction for a longer period of time allows the reaction to get closer to completion; but time is valuable, especially in the chemical industry.
In this particular set of experiments, I am interested in monitoring the effect of temperature on the yield of my reaction over time. I know that increasing the temperature increases the rate of the reaction, but I would also like to use as little energy to raise the temperature as possible. Based on industry standards, my specific research question is: what is the lowest temperature (by increments of 10 °C) at which the reaction nearly reaches or reaches completion within four hours?
Typically, in optimization, a specific chemical is chosen as a “model compound.” This compound is used in every reaction during the optimization process so the effect of different reaction conditions on yield can be easily compared. So, all the reactions run in this study used my model compound, and the only thing that changed was the temperature.
In this case, the reaction will not progress unless the temperature is at least 40 °C. So, I decided to run the same reaction with my model compound at 40 °C, 50 °C, and 60 °C, checking the progress of the reaction every hour. With my constant variables (or “controls”) in place and my changing variables selected, I was ready to begin running experiments.
Data collection begins before a single measurement is taken. I observe the reaction visually while setting it up and note anything that I think could cause an error in my measurements. For example, if a bit of my model compound is stuck to the side of the reaction vessel and I expect that it is not reacting, I write that down because it could cause the yield calculated for that trial to be low.
This is a photo of my lab notebook while I was in the middle of running the experiments. As you can see, sometimes things are crossed out when I got a more accurate measurement later. There are lots of notes written in.
For optimization, yields are calculated by characterizing the reaction using a technique called nuclear magnetic resonance, or NMR. NMR allows me to add a known amount of a substance to my reaction mixture and determine how much of another substance is present from the data I receive.
I won’t go too in-depth about how this works, but essentially, the technique produces a spectrum that gives me information about the characteristics of chemicals in the sample. I am looking for two sets of peaks in the spectrum: one that corresponds to my internal standard and one that corresponds to my intended product. I can calculate the ratio of the area under the curve of these peaks on the spectrum to determine how much of those substances are in the reaction mixture.
The important thing to know, here, is that I am still looking critically at the information in front of me at this stage. The appearance of this spectrum can give me a lot of information other than the areas of the peaks I am looking for. I might notice, for example, that all the peaks are low. This would tell me that my sample was very dilute, and might not be representative of my full reaction mixture.
Most of all, the peaks need to be well-separated for me to get an accurate measurement of the area under the curve. If peaks are not well-separated, I will overestimate how much of a substance is in the reaction mixture, because some of the area under the curve will come from another substance. If this happens, I make note of it in the spreadsheet where I collect my NMR data. This way, I can be sure the measurements I take are of high-quality before I report them.
Once I have made all my measurements, I put them into a spreadsheet to calculate the yields of my reactions. Here is what my spreadsheet looked like while I was in the middle of working with my data and checking for quality. I like to highlight data that has passed all of my quality checks, using different colors to group data together.
Data points that don’t make sense, such as yields that are greater than 100%, are discarded at this stage. I always try to figure out what caused an error like this. If I don’t know why it happened, the same error could be affecting my data in smaller ways without my noticing.
If the data does appear to be affected by a potential source of error that I wrote down, even if it is not technically impossible, I have the option of taking a new sample or rerunning the reaction depending on where I expect the error occurred.
To ensure data is reproducible, reactions are often run multiple times. I only ran each reaction once at this stage, but it is not uncommon to run every reaction twice or three times and use average values. For different branches of science in different contexts, the number of trials for each data point can be even higher. Data with many trials can be further analyzed to ensure the quality of the data is high. For example, any dataset with at least four points can be analyzed for outliers, and those points can be discarded.
All these things are simply additional checks to make sure that data is of high quality.
This, again, is the final figure displaying the data I collected for the optimization of my model reaction at the three temperatures I chose. This figure is intended for a progress report given to other members of my research group, so we can all quickly observe the trend in reaction rate as temperature increases.
I have answered my research question! There is a temperature at which I can run my reaction approximately to completion within four hours: the yield reaches 98% at some time between three and four hours when the reaction is run at 60 °C. If the reaction is any colder, the reaction proceeds too slowly for my purposes, so 60 °C is my optimal temperature. I do not need to try any higher temperatures.
I hope this helps you appreciate that the attention to detail behind the collection of even straightforward data can quickly become immense. High quality science builds trust within the scientific community as well as with the public, and we can only know science is of good quality by scrutinizing it. When you look at research, ask yourself: is this quality data? How do you know? Thinking critically about data for yourself can teach you a lot about the world and how we understand it.