Notice that the two stable states at 2.3 and 3.9 angstroms are local free energy minima, analogous to how they were maxima in the probability function. Thermodynamic systems tend toward states of lower energy. Which state did the system prefer?
awk '$1<.3234 {print}' rms2 > lo
awk '$1>.3234 {print}' rms2 > hi
wc -l lo hi
591 lo
1409 hi
2000 total
The protein visited the high RMSD state by a ratio 1409/591 = 2.4 = Keq. This makes sense because the low RMSD state was 1 kcal higher in energy and was visited only 30% of the time. Also note how the high RMSD state is narrower: this typical of the closed conformation of a receptor compared to its more flexible open state.
./sts lo 1
# Mean Std dev Min Max n #file lo column 1
0.2383978 0.03179934 0.1682214 0.3215592 591
./sts hi 1
0.3873937 0.01717574 0.3236184 0.4503823 1409
In the vicinity of each energy minimum, the energy function approximates a quadratic parabola. This is true as you approach any local extremum. What is the probability distribution that results from a quadratic energy term? For a general reaction coordinate x it is:
p(x) = exp ( -x2 / RT ), where RT is constant at a given temperature. Equation 4
Equation 4 is the function for the normal distribution of statistics, also known as a bell curve or Gaussian function. The maximum in p occurs at x = 0 and the curve decays with width, or standard deviation, equal to (RT / 2) 0.5. The higher the temperature, the higher the probability of visiting high energy states and the more spread out the distribution.
The normal distribution is often observed in everyday practice. An example is the oscillation of a harmonic spring, in which the energy is a quadratic function. There is another more subtle reason why normal distributions are ubiquitous. There is a principle of statistics known as the central limit theorem. It states that regardless of the underlying distribution, the means of randomly drawn samples themselves approach a normal distribution. The standard error of the sample mean (SEM) is sm = s / n0.5, where n is the sample size and s is the standard deviation of the original distribution.
An application of the central limit theorem involves error propagation. Suppose you weigh the first component of some object with error sa. Then you take a second measurement of the second component of the object to precision sb. You want to know the error in your estimate of the total weight of the object which is equal to the sum of the two measurements. If we assume random error in your measurements that is normally distributed, and that the error in the two measurements is independent of each other, it turns out that the error in the sum is normal and obeys a simple relationship (in the limit of large sample size):
stotal2 = sa2 + sb2 Equation 5: Additivity of variances
If sa = 3 and sb = 4 then stotal = 5, which is smaller than 7: the errors partially cancel each other out.
Q9: Using the analysis below, test and explain the additivity of variances. Since lo and hi give two approximately normally distributed variables, define a third fictitious variable as their sum, and test the theorem. By what percent is additivity of variances overestimating the standard deviation for the sample sum in this case? (1.XX %)
There are only 591 values in file lo, so extract 591 entries from hi:
awk 'NR % 2 == 1' hi | head -n 591 > hi.2 awk gets every 2nd line from line 1; head keeps the first 591 entries
The unix paste command combines two columns into one file (separated by space):
paste -d ' ' lo hi.2 > hilo
awk '{print $1+$2}' hilo > sumhilo
./sts lo 1
# Mean Std dev Min Max n #file lo column 1
0.2383978 0.03179934 0.1682214 0.3215592 591
./sts hi 1
0.3873937 0.01717574 0.3236184 0.4503823 1409
./sts sumhilo 1
0.6261478 0.03568436 0.519937 0.744283 591
python
>>> (0.03179934**2+0.01717574**2)**0.5
0.03614144530844333
The standard deviation for the sum appears well approximated by the square root of the two variances summed. Hit ^d when you're ready to exit the python interpreter or type:
>>> exit()
To save a record of the Unix commands you entered:
history > ~/day3/history.out
Check the size of your folders to see if we need to clean anything up:
du -sh ~/day? or du -sh *
Find will search the folders for files >1Mb:
find ~/day* -size +1M
Try to keep your data to within a few Megabytes, because we will be copying it to my server next. Delete unneeded files with the remove command (rm file).
Part 5. Unix Conclusion: Remote login (ssh) UNE's security firewall requires a cable ethernet connection.
The next step is to use your terminal to remotely access my computer, called the host or server. Your user terminal that logs on to the server is called the client. This is termed remote login. Because you have been granted access to a private computer, your connection and any information you transmit will be encrypted using the Secure Shell (ssh) program.
Data being transmitted over a secure network is encoded into an indecipherable binary code by the client. The receiving computer needs a key to unlock, or decode the data back into useable information (e.g. ASCII files). The host computer will have the necessary 128-bit encryption key, a binary string, which will be added to each 128-bit chunk of data to decode it. There are 2128 possible such keys, making it virtually impossible to crack the code. You can also encrypt files on your disk (in Vim for example, enter :X in command mode and save the file).
To log into a machine you need to have a user account and password and the server needs to have an Internet Protocol (IP) address on the network. For a server, the number is statically assigned by an administrator during setup. For a laptop that you turn on at different times, a dynamic IP address is assigned at the time of logging on the network from what addresses are still available. IP addresses are denoted using four 8-bit integer numbers (between 0 and 255) separated by periods. There are 256(4) = 4 billion possible IP addresses because only 4 bytes (32 bits) are used in storing the binary address. My Linux server on campus is connected to a wired ethernet port and has the static IP address: 10.2.4.187
You're familiar with sending email to a user on a network: rhills@une.edu The 'une' is a hosted subdomain on the .edu internet web domain. Email is not hosted on a particular computer, but for logging on remotely you will need a specific computer. At some schools there may be a chemistry server so I would enter ssh rhills@chem.une.edu
In this case chem.une.edu actually refers to a specific host IP address, and the Domain Name System (DNS) translates the une.edu domain name into the actual IP address.
To log into my Linux server from a Unix terminal, you will need a campus ethernet connection. You may need to turn off your wifi. Load an Internet page to confirm that the network works. To log on using Secure Shell, you must enter the username (if it is different than the username on your current computer), followed by the IP address or hostname (similar to an email address):
ssh student@garcia.une.edu user student logs on to host computer garcia
ssh student@10.2.4.187 our network may require you to enter my static IP address
When you connect to a computer for the first time it will ask you if you are sure, type:
yes enter
You will then need to type the password and hit enter within a few seconds before the connection times out. If you don't get a response after a few seconds or you mistyped the ssh command, hit control C (^c) to cancel and return to the Unix prompt and try again.
You should see a different Unix prompt when you are logged on to my workstation, named garcia:
garcia:~ %
You will also have access to shortcuts on my machine. I have set up l as an alias for 'ls -lFGo'. Try it out:
l enter
total 12
drwxr-xr-x 2 student 4096 Apr 5 13:01 2022/
drwxr-xr-x 2 student 4096 Apr 5 13:02 bin/
drwxr-xr-x 2 student 4096 Apr 5 13:03 diploma/
We'll be uploading your Unix files in your own directory inside ~/2022 on my Linux server.
cd 2022
mkdir yourlastname
ls
Now garcia is actually set up as a Linux computer cluster. It has several compute nodes (0-5) that users can run large calculations on using a job queueing system. Aggregating computing power from multiple servers is known as high performance computing. To see a list of nodes and/or running jobs:
qstat -f
queuename qtype resv/used/tot. load_avg arch states
----------------------------------------------------------------------
all.q@compute-0-0.local BIP 0/0/64 0.01 lx-amd64
----------------------------------------------------------------------
all.q@compute-0-1.local BIP 0/0/64 0.01 lx-amd64
----------------------------------------------------------------------
all.q@compute-0-2.local BIP 0/0/64 0.01 lx-amd64
----------------------------------------------------------------------
all.q@compute-0-4.local BIP 0/96/96 96.04 lx-amd64
67 0.55500 job.sge rhills r 04/05/2021 17:01:18 96
----------------------------------------------------------------------
all.q@compute-0-5.local BIP 0/0/96 -NA- lx-amd64 au
The head node (server garcia) and compute nodes 4-5 each have two AMD Epyc 7401 processors (24 cores/48 threads). Nodes 0-2 have dual Epyc 7281 chips (64 threads). While submitting a job gives you access to the cores on one node, data generated is stored remotely to your user home directory. Home directories are stored on a redundant RAID disk array hosted by the head node. The au state for compute-0-5 means it's offline. Job #67 (job.sge) is currently running on node 4, using all 96 threads. The other nodes are currently idle (no CPU load).
Now that you have a directory with your name, log out of my machine:
exit
Confirm that your are back on your laptop:
ls -lrt
We needed to log out because the connection is one way: my machine is set up as a server but your laptop is not.
Now we will upload your data to the server using remote copy. Remote copy has similar syntax to cp but you need to specify the username and hostname for the destination (2nd space-separated command argument):
Mac: rsync -avP ~/day? student@garcia:~/2022/yourlastname/
PC: scp -r ~/day? student@garcia.une.edu:~/2022/yourlastname/
Enter your password before it times out and you should see a list of the files and their transfer speed. The ? wildcard is for any single character, it copies your day1, day2 and day3 directories. The rsync command is a fast way to back up files because it mirrors the source to the destination directory, each time only updating files that have changed.
Lastly, confirm the files are on the host. ssh can execute a command for you without logging in:
ssh student@garcia ls -l 2022/yourlastname
(You shouldn't have to type une.edu since you are logged on the network from inside that domain.)
Download a Certificate of Completion to your laptop and clean up all the files you've created..
PC: cd /cygdrive/c/Users tab tab
The tab key should give you a list of options. Type the rest of the correct path to your Windows home directory:
cd /cygdrive/c/Users/username hit enter
mkdir unix
pwd confirm that you're inside the newly created unix directory
cp -a ~/* . copy your data (day1, day2, day3) from Cygwin to the new unix folder (.) inside your Windows folder (otherwise you will lose your work when you uninstall Cygwin!)
Mac: mkdir ~/unix create a folder for all your work
cd !$ the variable !% is a convenient shortcut for the argument of the last command
pwd confirm that you're inside the newly created unix directory
mv ~/day? ~/unix move all your work inside a single master unix folder
ls ~/unix
PC/Mac:
ssh student@garcia ls diploma check the punctuation of your name
scp student@garcia:diploma/firstname.pdf . secure copies the pdf file to your current directory
Mac: open firstname.pdf
# This concludes the Unix programming portion of our workshop. Great Job! Now that you know how to create your own function, you can program virtually any calculation for a large data set. Modern languages do have many built-in functions to assist you in rapid data analysis. One essential Python library for numerical analysis is NumPy:
python
import numpy as np
np.std([3,4,5],ddof=1) standard deviation of a list of sample data
...and so on for .var() .mean() .median() functions etc.
See also: https://numpy.org/doc/stable/reference/generated/numpy.histogram.html