Tutorial 3) Part 3: Visualizing Data Distributions

We have constructed three different histograms for our bimodal data distribution from Tutorial 2 (rmsd.dat). Generally, the larger the data set, the more bins you should include. This will ensure sufficient statistical sampling across all bins. In the limit of an infinitely large data sets, the bin width will approach zero and the histogram will resemble the smooth probability density function. Let's see how well our simulation of 2,000 data points approximates the bimodal distribution function.

Visualizing data sets is a crucial aspect of data analysis. It allows us to obtain insight into the distribution of data and then communicate our findings. Observe the plot of our three histograms (Fig. 1). In a bimodal distribution, two well-defined peaks should be evident corresponding to the open and closed states of our protein receptor.


Q7: Look at Figure 1. What are the advantages and disadvantages to using too many or too few a number of histogram bins for a given data set? Which histogram best captures the true distribution function?

Figure 1. Frequency of observed bins in root-meant-square deviation (RMSD) from dynamics simulation of protein receptor conformation. Relative frequencies are shown for a histogram containing only five bins (dashed line). y-scaled frequencies are overlaid for comparison of the 30- and 300-bin histograms (solid lines).


Figure 1 was prepared using the following gnuplot commands..

set terminal postscript eps enhanced colour "Helvetica" 10.5

set encoding iso_8859_1

set size .42 #.38,.36

set output "rms2.hist.eps"

set xlabel "RMSD (nm)" font "Helvetica-Bold,12"

set ylabel "probability" offset 1.5 font "Helvetica-Bold,12"

set key font ",9"

set key at .38,1.14

plot 'rms2.hist300' u 1:($2*60.0/2000) w l lw 3.6 lc rgb "gray70", \

'rms2.hist30' u 1:($2*6/2000) w l lt 1 lw 3 lc rgb "black", \

'rms2.hist5' u 1:($2/2000) w l lt 2 lw 4 lc rgb "black"


Python contains two libraries for plotting: Seaborn (advanced statistical visualization) and Matplotlib. The latter should be included with your basic python installation. To prepare a histogram, enter this code in a python terminal or run plt.hist:

import matplotlib.pyplot as plt #matplolib doesn't work in cygwin

rmsd = []

for line in open('rmsd.dat', 'r'): #read text file line by line..

values = [float(s) for s in line.split()]

rmsd.append(values[1])

rmsd.pop(0) #remove first unequilibrated value

plt.hist(rmsd, bins=30)

plt.savefig("hist30.png")

plt.show()

________________________________________________________________________________________


Back at your Unix prompt, let's examine the 5-bin histogram..

cd ~/day3

cat rms2.hist5

0.19643749 201

0.25286967 328

0.30930185 72

0.36573403 934

0.42216621 465


Note how the five bins do a poor job of locating the true maxima of the density function: 0.253 nm is listed (the bin center) when the true maximum should be closer to 0.23 nm. Using too many bins on the other hand captures what is known as random high-frequency noise. Finding the center of a curve, a location parameter, with the 300-bin data would require a data smoothing function be applied. Smoothing functions come in a variety of forms (e.g. polynomial), which make assumptions about the functional form in each smoothing window. And you'd have to arbitrarily choose the number (width) of the smoothing windows. An example of this in statistics is kernel density estimation.


The more bins we use, the lower the count numbers are in each bin, making it hard to directly compare the resulting histograms. To get the three curves in Figure 1 to overlay, their y-values are each multiplied by a different scaling factor. A scaling parameter stretches or squeezes a graph. E.g., multiplying the y-values in rms2.hist300 by a factor of 10 relative to hist.30 makes its y-scale 10 times higher, which would compensate for using 300 vs. 30 bins.


Consider hist5 again. The bin centered at 0.25 nm was observed 328 times. The value 328 is the absolute bin count, or frequency. The "probability" that data falls in a bin is given by the relative frequency, i.e. the number of observed occurrences divided by the total number of data points: 328/2000 = 0.164 (dashed line, Fig. 1).

relative frequency = number of occurrences/total observed outcomes Equation 1

This means that data fell into the 0.25 nm bin for a frequency of 16.4 percent of the time.


Dividing each bin count by the number of total occurrences gives the relative frequency, but its magnitude depends on the number of bins. If the width of each bin in the x-axis is 1, then the relative frequency plot is identical to the probability density. The total area under a histogram normalized for probability density is 1, because the sum of the probabilities of all outcomes must equal 1.


Q8: a) Open the three histogram files (rms2.hist30, etc.). Find the peak probability value for the bin nearest the two states A and B in the 5-bin, 30-bin and 300-bin histograms. Compile the relative frequencies in a table. b) Why did we need to scale the relative frequencies (column $2 in gnuplot) to compare the histograms? c) Which histogram in Figure 1 is plotted as the unscaled relative frequency?


Part 4: Statistical Thermodynamics


Recall the Gibbs equation relating the free energy of a chemical to its equilibrium constant:

dG o = -RT ln Keq Equation 2

Equation 2 is a macroscopic statement of thermodynamics: it concerns observable properties of matter. Statistical thermodynamics relates observable thermodynamic quantities to the probability theory of microscopic ensembles of molecules. From our histogram, we have the probability distribution for structural RMSD observed in a time dynamics simulation, which generated an ensemble of 2,000 microscopic states. The thermodynamic free energy as a function of our reaction coordinate (RMSD) becomes:

G(rmsd) = -RT ln p(rmsd) Equation 3

Or, at 310 K.. G = -0.616 ln p


We can use awk to easily convert the probability distribution to its underlying energy landscape. Sampling the thermally accessible conformations on this energy landscape is what gave rise to the ensemble distribution observed. Remember: p = exp(-G/RT)

awk '{print $1,-0.616*log($2)}' rms2.hist30 > G_vs_rmsd

./sts G_vs_rmsd 2

# Mean Std dev Min Max n #file G_vs_rmsd column 2

-2.159906 0.7560361 -3.587 -0.853957 30


You may have noticed in awk that we forgot to divide the bin count ($2) by 2000 to obtain the relative probability. That's okay because of the properties of logarithms... log($2/2000) = log($2) - log(2000). log(2000) is just an additive constant that won't affect our analysis.

It turns out we only ever know the energy up to an additive constant--it is only the relative difference in energy that we are interested in. The absolute value of the energy has no meaning. The choice of zero-point energy is arbitrary, so we just choose a convenient value. In our case, we'll define the minimum energy of the lowest stable state to be zero..

Figure 2. The free energy is computed as a function of receptor conformation (RMSD) for the simulation trajectory using Equation 3. The energy barrier maximum defines the dividing line between the two states: 0.323 nm.


% gnuplot

gnuplot> plot 'G_vs_rmsd' u 1:($2+3.587) w l lw 2.5 lc rgb "black"

gnuplot> ^D

Notice that the two stable states at 2.3 and 3.9 angstroms are local free energy minima, analogous to how they were maxima in the probability function. Thermodynamic systems tend toward states of lower energy. Which state did the system prefer?


awk '$1<.3234 {print}' rms2 > lo

awk '$1>.3234 {print}' rms2 > hi

wc -l lo hi

591 lo

1409 hi

2000 total


The protein visited the high RMSD state by a ratio 1409/591 = 2.4 = Keq. This makes sense because the low RMSD state was 1 kcal higher in energy and was visited only 30% of the time. Also note how the high RMSD state is narrower: this typical of the closed conformation of a receptor compared to its more flexible open state.


./sts lo 1

# Mean Std dev Min Max n #file lo column 1

0.2383978 0.03179934 0.1682214 0.3215592 591

./sts hi 1

0.3873937 0.01717574 0.3236184 0.4503823 1409


In the vicinity of each energy minimum, the energy function approximates a quadratic parabola. This is true as you approach any local extremum. What is the probability distribution that results from a quadratic energy term? For a general reaction coordinate x it is:

p(x) = exp ( -x2 / RT ), but RT is a constant. Equation 4


Equation 4 is the function for the normal distribution of statistics, also known as a bell curve or Gaussian function. The maximum in p occurs at x = 0 and the curve decays with width, or standard deviation, equal to (RT / 2) 0.5. The higher the temperature, the higher the probability of visiting high energy states and the more spread out the distribution.


The normal distribution is often observed in everyday practice. An example is the oscillation of a harmonic spring, in which the energy is a quadratic function. There is another more subtle reason why normal distributions are ubiquitous. There is a principle of statistics known as the central limit theorem. It states that regardless of the underlying distribution, the means of randomly drawn samples themselves approach a normal distribution. The standard error of the sample mean (SEM) is sm = s / n0.5, where n is the sample size and s is the standard deviation of the original distribution.

Additivity of Variances

An application of the central limit theorem involves error propagation. Suppose you weigh the first component of some object with error sa. Then you take a second measurement of the second component of the object to precision sb. You want to know the error in your estimate of the total weight of the object which is equal to the sum of the two measurements. If we assume random error in your measurements that is normally distributed, and that the error in the two measurements is independent of each other, it turns out that the error in the sum is normal and obeys a simple relationship:

stotal2 = sa2 + sb2 Equation 5: Additivity of variances

If sa = 3 and sb = 4 then stotal = 5, which is smaller than 7: the errors partially cancel each other out.


Q10: Using the analysis below, test the additivity of variances. Since lo and hi give two approximately normally distributed variables, define a third variable as their sum and test the theorem. To what percent do the two results for the standard deviation agree?


There are only 591 values in file lo, so extract 591 entries from hi:

awk 'NR % 2 == 1' hi | head -n 591 > hi.2 awk gets every 2nd line from line 1; head keeps the first 591 entries


The unix paste command combines two columns into one file (separated by space):

paste -d ' ' lo hi.2 > hilo

awk '{print $1+$2}' hilo > sumhilo


./sts lo 1

# Mean Std dev Min Max n #file lo column 1

0.2383978 0.03179934 0.1682214 0.3215592 591

./sts hi 1

0.3873937 0.01717574 0.3236184 0.4503823 1409

./sts sumhilo 1

0.6261478 0.03568436 0.519937 0.744283 591


python

>>> (0.03179934**2+0.01717574**2)**0.5

0.03614144530844333


The standard deviation for the sum appears well approximated by the square root of the two variances summed. Hit ^d when you're ready to exit the python interpreter or type:

>>> exit()


To save a record of the Unix commands you entered:

history > ~/day3/history.out

Check the size of your folders to see if we need to clean anything up:

du -sh ~/day? or du -sh *

Find will search the folders for files >1Mb:

find ~/day* -size +1M

Try to keep your data to within a few Megabytes, because we will be copying it to my server next. Delete unneeded files with the remove command (rm file).


Part 5. Unix Conclusion: Remote login (ssh) UNE's security firewall requires a cable ethernet connection.


The next step is to use your terminal to remotely access my computer, called the host or server. Your user terminal that logs on to the server is called the client. This is termed remote login. Because you have been granted access to a private computer, your connection and any information you transmit will be encrypted using the Secure Shell (ssh) program.


Data being transmitted over a secure network is encoded into an indecipherable binary code by the client. The receiving computer needs a key to unlock, or decode the data back into useable information (e.g. ASCII files). The host computer will have the necessary 128-bit encryption key, a binary string, which will be added to each 128-bit chunk of data to decode it. There are 2128 possible such keys, making it virtually impossible to crack the code. You can also encrypt files on your disk (in Vim for example, enter :X in command mode and save the file).


To log into a machine you need to have a user account and password and the server needs to have an Internet Protocol (IP) address on the network. For a server, the number is statically assigned by an administrator during setup. For a laptop that you turn on at different times, a dynamic IP address is assigned at the time of logging on the network from what addresses are still available. IP addresses are denoted using four 8-bit integer numbers (between 0 and 255) separated by periods. There are 256(4) = 4 billion possible IP addresses because only 4 bytes (32 bits) are used in storing the binary address. My Linux server on campus is connected to a wired ethernet port and has the static IP address: 10.2.4.187


You're familiar with sending email to a user on a network: rhills@une.edu The 'une' is a hosted subdomain on the .edu internet web domain. Email is not hosted on a particular computer, but for logging on remotely you will need a specific computer. At some schools there may be a chemistry server so I would enter ssh rhills@chem.une.edu

In this case chem.une.edu actually refers to a specific host IP address, and the Domain Name System (DNS) translates the une.edu domain name into the actual IP address.


To log into my Linux server from a Unix terminal, you will need a campus ethernet connection. You may need to turn off your wifi. Load an Internet page to confirm that the network works. To log on using Secure Shell, you must enter the username (if it is different than the username on your current computer), followed by the IP address or hostname (similar to an email address):

ssh student@garcia.une.edu user student logs on to host computer garcia

ssh student@10.2.4.187 our network may require you to enter my static IP address

When you connect to a computer for the first time it will ask you if you are sure, type:

yes enter


You will then need to type the password and hit enter within a few seconds before the connection times out. If you don't get a response after a few seconds or you mistyped the ssh command, hit control C (^c) to cancel and return to the Unix prompt and try again.


You should see a different Unix prompt when you are logged on to my workstation, named garcia:

garcia:~ %


You will also have access to shortcuts on my machine. I have set up l as an alias for 'ls -lFGo'. Try it out:

l enter

total 12

drwxr-xr-x 2 student 4096 Apr 5 13:01 2022/

drwxr-xr-x 2 student 4096 Apr 5 13:02 bin/

drwxr-xr-x 2 student 4096 Apr 5 13:03 diploma/


We'll be uploading your Unix files in your own directory inside ~/2022 on my Linux server.

cd 2022

mkdir yourlastname

ls

Now garcia is actually set up as a Linux computer cluster. It has several compute nodes (0-5) that users can run large calculations on using a job queueing system. Aggregating computing power from multiple servers is known as high performance computing. To see a list of nodes and/or running jobs:

qstat -f

queuename qtype resv/used/tot. load_avg arch states

----------------------------------------------------------------------

all.q@compute-0-0.local BIP 0/0/64 0.01 lx-amd64

----------------------------------------------------------------------

all.q@compute-0-1.local BIP 0/0/64 0.01 lx-amd64

----------------------------------------------------------------------

all.q@compute-0-2.local BIP 0/0/64 0.01 lx-amd64

----------------------------------------------------------------------

all.q@compute-0-4.local BIP 0/96/96 96.04 lx-amd64

67 0.55500 job.sge rhills r 04/05/2021 17:01:18 96

----------------------------------------------------------------------

all.q@compute-0-5.local BIP 0/0/96 -NA- lx-amd64 au


The head node (server garcia) and compute nodes 4-5 each have two AMD Epyc 7401 processors (24 cores/48 threads). Nodes 0-2 have dual Epyc 7281 chips (64 threads). While submitting a job gives you access to the cores on one node, data generated is stored remotely to your user home directory. Home directories are stored on a redundant RAID disk array hosted by the head node. The au state for compute-0-5 means it's offline. Job #67 (job.sge) is currently running on node 4, using all 96 threads. The other nodes are currently idle (no CPU load).


Now that you have a directory with your name, log out of my machine:

exit

Confirm that your are back on your laptop:

ls -lrt

We needed to log out because the connection is one way: my machine is set up as a server but your laptop is not.


Now we will upload your data to the server using remote copy. Remote copy has similar syntax to cp but you need to specify the username and hostname for the destination (2nd space-separated command argument):


Mac: rsync -avP ~/day? student@garcia:~/2022/yourlastname/

PC: scp -r ~/day? student@garcia.une.edu:~/2022/yourlastname/


Enter your password before it times out and you should see a list of the files and their transfer speed. The ? wildcard is for any single character, it copies your day1, day2 and day3 directories. The rsync command is a fast way to back up files because it mirrors the source to the destination directory, each time only updating files that have changed.


Lastly, confirm the files are on the host. ssh can execute a command for you without logging in:

ssh student@garcia ls -l 2022/yourlastname

(You shouldn't have to type une.edu since you are logged on the network from inside that domain.)


Download a Certificate of Completion to your laptop and clean up all the files you've created..


PC: cd /cygdrive/c/Users tab tab

The tab key should give you a list of options. Type the rest of the correct path to your Windows home directory:

cd /cygdrive/c/Users/username hit enter

mkdir unix

pwd confirm that you're inside the newly created unix directory

cp -a ~/* . copy your data (day1, day2, day3) from Cygwin to the new unix folder (.) inside your Windows folder (otherwise you will lose your work when you uninstall Cygwin!)


Mac: mkdir ~/unix create a folder for all your work

cd !$ the variable !% is a convenient shortcut for the argument of the last command

pwd confirm that you're inside the newly created unix directory

mv ~/day? ~/unix move all your work inside a single master unix folder

ls ~/unix


PC/Mac:

ssh student@garcia ls diploma check the punctuation of your name

scp student@garcia:diploma/firstname.pdf . secure copies the pdf file to your current directory


Mac: open firstname.pdf


Conclusion

# This concludes the Unix programming portion of our workshop. Great Job! Now that you know how to create your own function, you can program virtually any calculation for a large data set. Modern languages do have many built-in functions to assist you in rapid data analysis. One essential Python library for numerical analysis is NumPy:


python

import numpy as np

np.std([3,4,5],ddof=1) standard deviation of a list of sample data

...and so on for .var() .mean() .median() functions etc.

See also: https://numpy.org/doc/stable/reference/generated/numpy.histogram.html

Proceed to: Tutorial 4