Results:
Summary:
My project objective was to create an easily accessible program that quantifies statistical errors which can be used by others to calibrate further experiments was met. The program developed is discussed below.
Specifically, the program takes in Molecular Dynamics (MD) data, and constructs coarse grain samples from the data to run a series of four statistical tests to test whether the given parameters of the data are in an equilibrated state, outputting the results of these tests while also outputting a graphical visualization of both the premodified and postmodified data for better visualization of what is happening. The inputs for the program include: Start time Time Step and Segments.
Coarse grain sample and visualization:
First off the program creates a coarse grain sample from the existing data by averaging several data points together (denoted by m) to create n "segments" of data. This effectively draws a statistical sample from the large population of data which is then used to run statistical tests (example shown on left).
Graphical output from the program showing the creation of the coarse grain sample for m=30.
Overview of statistical tests within program:
Following that four statistical tests are performed on the new coarse grain sampled data.
The four tests that are performed and outputted are :
A Mann Kendall test for mean
A Mann Kendall test for variance
A W test for normality
A Von Neumann Test for Serial Correlation
Further detail for all four tests are explained below:
Mann Kendall Test:
The Mann Kendall test is used to determine whether the data collected over time is consistently increasing/decreasing in trend, and it is designed specifically to fit highly oscillating data (such as MD data) as opposed to simple linear normal regression.
The test statistic for this test is shown to the right:
Thus tests 1) and 2) test for if the data has stabilized and reached an equilibrated state, as initially MD data fluctuates (both in means and variations) especially towards the beginning as it has not yet reached and equilibrated state, and thus will begin moving towards the system's equilibrated state.
This initial fluctuation/pre-equilibrated phase needs to be discarded when analyzing MD data as it is not representative of the system as a whole since it is unstable.
When this test fails (the test gives convincing evidence that the data is either increasing or decreasing over time), the suggestion is to increase the start time to cut off the non equilibrated data.
Example from paper, where the indicating start time is when the data has equilibrated. Thus the purpose of the Mann Kendall test is to find this point.
Example of noticeably unequilibrated data. Graph generated by the program.
W test for normality
The W Test for Normality (also known as Shapiro–Wilk test) is used to determine if a set of data points are approximately normally distributed. Equation for the test statistic is given on the right:
This is used in this project to verify that the averaged data from the coarse grain sample is normally distributed, which is essential for the next test (von Neuman test). As long as the amount of points in each sample (m) is sufficiently large, this test should pass due to the central limit theorem.
Thus if this test fails (the test finds convincing evidence that the distribution is not normal), the suggestion is to simply increase m, the amount of points in order to for the data to approach normality.
Von Neumann Test for Serial Correlation
The Von Neumann Test for Serial Correlation tests for a correlation in the data by summing mean squared differences, effectively trying to fit a line through the data and calculating the slope of that line, which should be close to 0 if the data is not correlated. The expression for the test statistic is given on the right:
This test has a prerequisite assumption that the data is approximately normally distributed, which is why the previous test for normality was conducted. If this test fails (the test finds convincing evidence for either a positive or negative correlation), the suggestion is to increase the amount of points (m) in each segment to try to smooth out variations.
Visualization of what a Von Neumann Test does.
Last General Notes about Tests:
If all four tests pass, the data under the given inputs (start time, time step, and segments) is equilibrated and it is valid to draw conclusions based on the data. If the user was not already using the maximum n-value (the amount of coarse grain segments tested) (which is usually common), after all four tests pass, the user should increase their n value to added further statistical power to all the tests, especially the von Neumann test in order to increase the certainty of the tests.
Finally, the number of segments (n) should never be below 24, as that is the threshold for the von Neumann test to have a reasonable statistical power (which is the certainty of being correct).
Example of Code/Coding Process:
On the right is an example of how the mathematical expressions of the various statistical tests were converted into Python code.
Full Sample Output from the Program:
Displayed is a full sample output from the program including the graphs generated and the inputs taken along with the outputs given.
Discussion:
First off, it is important to note some of the major limitations of this project in order to assess how successful the process was and if it met the initial project objectives.
Limitations:
The first of these limitations is that since the program was created in Python this naturally means it is meant for more general use with easier accessibility (which is in line with the initial program objectives). However this means that it would be unable to handle massive amounts of MD data, so a program in a faster compiled language (such as C++ or Java) may be required for large amounts of data.
It is also important to note that this is just one way to compute whether the MD data is equilibrated and may work better or worse on different data sets. Thus this is only a general solution program, which may not always be the most optimal pick for certain types of MD data, in which certain modifications should be made.
Tie back to Project Objective:
However, despite limitations the program created did align with the initial project objectives, as the program is easily accessible program (available through Python which is easily transferable on the web), can quantify statistical errors within MD data through the various statistical tests employed, and can be used by other researchers to calibrate further experiments as it allows researchers to understand when their data is valid and what changes need to be made.
Thus this project is reasonably successful as long as one remembers the limitations and cautions of associated with the program.
Conclusion:
Contribution to academic conversation:
Overall, this project brings the aspect of statistical test in simulations, something that is usually in the background and glanced over, and thus could contain flaws, to the forefront of the academic conversation developing a rigorous and systematic general method of quantifying and interpreting statistical results in MD data. This aids in the entire process of MD simulations providing a tool for other researchers to research properties of molecular systems which can be applied to many real world problems such as aiding in drug design, helping researchers better understand the biological process and learn more about ourselves such looking into protein folding, as well as many more practical applications.
Implications:
Thus this program will aid molecular dynamics researchers working on small to medium sized molecular dynamics simulations by helping them confirm points where their data is equilibrated.
This also provides a basis for others to add on to this program in order to better suit their own simulations and further improve statistical analysis for MD Simulations.
And as stated before all of these points can be aid in the researching of many real world and practical problems.
Further studies:
Further direction for research based on this project include:
Creating another version of the program in a compiled language for large data sets.
Computing further tests and methods of analyzing MD data and compare them to each other in order to find the best test to suit the data.
Reflection:
1) Think back to the initial curiosity that sparked your inquiry. What other curiosities do you have and how has this process prepared you to explore them?
Completing the project has raised more questions in my head on how different statistical tests and methods of analyzing MD data might compare to each other especially on varying data sets, as well an increased general curiosity in computer simulation methods as a whole. However, completing this project has given me all the tools I need to explore these curiosities, as this project has enhanced experience in computer programming, especially within the field of scientific computing and analysis, which has enabled me to explore more of these problems through the process of developing programs and algorithms and testing them as a form of both testing hypothesis and creating a useful end product. Thus specifically, in the future I plan to explore some of these curiosities by writing further programs and test them in order to see how different methods perform and what factors in the data may affect their performance.
2) How did you handle the uncertainty of the research process?
Throughout the research process, I handled uncertainty mainly using two methods: reconsulting the scholarly materials to further my knowledge of each of the statistical tests involved, as well as the process of just coding and more importantly testing inputs in order to gain a better understanding of the program. First off, whenever I uncertainty about any parts of the statistical tests and principles involved I would go back to consult the original works I based my research off of, specifically Statistical errors in molecular dynamics averages by Schirferl and Wallace, doing a deep dive into any concepts I did not understand by looking into their references on the principles in question and doing my own research on the topic as well, synthesizing a variety of perspectives to better understand the topic. Following that, the process of converting my current understanding of any topic into code also greatly helped in resolving uncertainties I had in the specific statistical principles, as often the reasoning for specific test began making more sense when I myself implemented it, as it pushes me to think critically throughout the implementation. Afterwards testing different input values into any specific module and carefully observing the output solidifies my understanding as it allows me to engage with the statistical principles involved hypothesizing on what the outputs for any given input should be and verifying my understanding through testing.
Finally, I would like to thank the following for their continuous support throughout my project:
Dr. Hai Lin for mentoring me throughout the entire process and generously supplying the initial testing data.
Mrs. Dobos for being my supervisor and providing valuable suggestions and guiding me throughout the entire process.
and lastly all my peers who provided feedback.