Tutorial 2) Part 4: Awk Programming

Awk is a data-driven scripting language consisting of a set of actions taken on streams of textual data, either run directly on files or as part of a pipeline. A standard feature of Unix-like operating systems, awk is used for text processing and data extraction.


Download the sample Protein Data Bank file: Files: pdb.txt


cd ~/day2

PC: cp /cygdrive/c/Users/username/Downloads/pdb.txt ~/day2

Mac: cp ~/Downloads/pdb.txt ~/day2

ls -lrt confirm the file was downloaded


If the download didn't work copy-and-paste the following commands into a Unix terminal:

# Atom# Atom Residue Chain Resid x y z

echo "ATOM 1 N MET A 28 11.579 13.109 10.857" > pdb.txt

echo "ATOM 2 CA MET A 28 10.243 12.524 10.957" >> pdb.txt

echo "ATOM 3 C MET A 28 9.813 12.069 12.377" >> pdb.txt

echo "ATOM 4 O MET A 28 8.675 12.379 12.739" >> pdb.txt

echo "ATOM 5 CB MET A 28 10.010 11.433 9.898" >> pdb.txt

echo "ATOM 6 CG MET A 28 8.549 11.290 9.496" >> pdb.txt

echo "ATOM 7 SD MET A 28 8.299 10.450 7.910" >> pdb.txt

echo "ATOM 8 CE MET A 28 8.650 11.788 6.773" >> pdb.txt

echo "ATOM 9 N PRO A 29 10.637 11.357 13.207" >> pdb.txt

echo "ATOM 10 CA PRO A 29 10.152 10.982 14.554" >> pdb.txt

echo "ATOM 11 C PRO A 29 10.140 12.135 15.573" >> pdb.txt

echo "ATOM 12 O PRO A 29 9.493 12.012 16.618" >> pdb.txt

echo "ATOM 13 CB PRO A 29 11.096 9.851 14.971" >> pdb.txt

echo "ATOM 14 CG PRO A 29 12.363 10.136 14.256" >> pdb.txt

echo "ATOM 15 CD PRO A 29 12.018 10.869 12.983" >> pdb.txt


Note the a real protein will have >1,000 atoms, each with an xyz location in the structure file.. Let's demonstrate how Awk can extract columns of data for subsequent processing:

awk '{print $7,$8,$9}' pdb.txt

Awk is smart enough to group the variable data types into individual columns based on the existence of whitespace between the values on a given line. The whitespace can be a combination of one or more blank spaces and tabs (\t). In the old days of Fortran, your program would have to specify how many characters wide the string variable was, how long the integer variable was, and in which character locations it could be found!

Try adding some whitespace in pdb.txt and verify that awk interprets the columns correctly if they are not all lined up. If you delete the 9th entry on one line it should just omit it, keeping all the entries in register.


Like the pattern searching functions in grep and Vi, awk can return just the lines matching a search string (/ /) or regular expression..

awk '/CA/ {print "CAatom:",$2}' pdb.txt extracts atom number for alpha carbons

awk '! /C/ {print}' pdb.txt (! extracts lines without string "C")

awk '/C[AB]/ {print $3,$2}' pdb.txt (match any character enclosed in [ ] )


Grep of course has simpler syntax but is limited in function:

grep CA pdb.txt

Awk can use logical operators, performing an if-then conditional on each line ($0 represents entire line):

awk '{if ($7 < 10) print $0}' pdb.txt


Print every 3rd line from line 2 (lines 2, 5, 8..):

awk 'NR % 3 == 2' pdb.txt

Extract lines 3 to 6:

awk 'NR==3,NR==6 {print NR,$0}' pdb.txt

Print line (record) numbers/number of fields:

awk '{print NR,$0,NF} END {print "lines:",NR}' pdb.txt


Complex math can also be performed in awk. In scientific applications we'll be taking natural logarithms (base e). In programming languages natural log is 'log' in contrast to base 10 log which is log.10.

awk '{print -0.616*log($7)}' pdb.txt > pdb.out

When taking logs in programs it is important not to take the log of zero, which is mathematically undefined. Here, we pipe an empty string to awk just to test its natural log function:

echo "" | awk '{print log(2.718)}'

echo "" | awk '{print log(0)}'

awk 'BEGIN {print log(-1)}'

awk 'BEGIN {print sqrt(3.99)}'


Awk is generalizable scripting language, allowing for real programming, even without a data file..

awk 'BEGIN {initializations}

/search pattern 1/ {program actions}

/search pattern 2/ {program actions}

END {final actions}' <file>


awk 'BEGIN {for(i=1;i<=6;i++) print "square of",i,"is",i*i}'


Let's write a program, contained in the file pd.awk, to compute the molecular mass of our protein:

BEGIN {print "Calculating mass of peptide.."}

$1~/ATOM/ && $3~/^C/ {mass += 12.011; print "atom",$2,"carbon"}

$1~/ATOM/ && $3~/^N/ {mass += 14.007; print "atom",$2,"nitrogen"}

$1~/ATOM/ && $3~/^O/ {mass += 15.999; print "atom",$2,"oxygen"}

$1~/ATOM/ && $3~/^S/ {mass += 32.06; print "atom",$2,"sulfur"}

END {print " Molecular mass =",mass/1000,"kilodaltons (kDa)"}


pd.awk looks for entries that have both ATOM in field 1 AND (&&) have a field 3 starting with (^) a C, N, or O character. To execute our script file, use awk -f at the Unix prompt:

awk -f pd.awk pdb.txt


Q7: Use pd.awk to determine the mass of the dipeptide in pdb.txt in Daltons. Find the molecular weight of a real protein in kilodaltons: Files: 3ink.pdb. Next, modify the pd.awk program to print out the total number of carbon atoms contained in the molecule. Show your code and your results.


_________________________________________________________________________

Part 5: Application to Molecular Modeling

A useful formula in molecular/drug modeling applications is:

dG o = -RT ln Keq Equation 1

The Gibbs equation relates the free energy change of a chemical reaction to the equilibrium constant Keq, which is the ratio of [products]/[reactants]. R is the ideal gas constant and T is the thermodynamic temperature in Kelvin. At 37 Celsius, or 310 K, RT = 0.616 kcal/mol, which is a unit of energy. It turns out that 0.6 kcal is a measure of the thermal energy due to random fluctuations at room temperature. In other words, a good drug should bind its target with energy several times thermal energy (5-10 kcal). Because of the log relationship, for each kcal in stabilization (binding energy), a drug is ~5 times more likely to be bound to its target. In other words, the Keq in Equation 1 can be thought of as a probability.


In real applications we deal with a general process from state A to state B:

A <=> B, Keq =[B]eq /[A]eq Equation 2

Instead of arbitrarily saying the system is simply in either of one of the two states, we need to define a reaction or progress coordinate specifying how far the system has proceeded along the continuum from state A to state B. In chemistry, this is called the instantaneous reaction quotient: Q = [B] / [A] for a system not at equilibrium. In computational applications instead of Keq we compute the probability p of a given value of the generic progress coordinate r : p (r) for a continuous transition between bound and unbound drug, for example.


Solving equation 1 for p: dG = -RT ln p, we get

p(r) = exp[-dG(r)/RT] Equation 3

In programming 'exp' denotes e = 2.718.. raised to some power:

awk 'BEGIN {print exp(1)}'

awk 'BEGIN {print exp(2)}'

Alternatively we can approximate e2 using a few decimal places..

echo "" | awk '{print 2.718**2}' (** denotes exponent)


Note from Equation 3 that the higher the energy of a state of the system, the lower its probability. This is a thermodynamic principle: systems tend to minimize their free energy. This stems from the second law of thermodynamics, that systems tend to maximize their entropy (disorder).


Q8: For the reaction of some chemical species A into B, what percent of the system is in state B when dG = +1 kcal, 2 kcal, 3 kcal or 4 kcal (assume equilibrium)? If you use the following command, do your results agree with the factor of 5 relationship?

awk 'BEGIN {print exp(-1/0.616)}'


Let's look at some real data generated from a molecular dynamics simulation by the Hills Lab. Download: Files: rmsd.dat. We will continue working with this data in Tutorial 3..

cd ~/day2

PC: cp /cygdrive/c/Users/username/Downloads/rmsd.dat ~/day2

Mac: cp ~/Downloads/rmsd.dat ~/day2


ls -lrt rmsd.dat contains 53828 bytes (54 KB)

wc -l * the word count command lists the number of lines in each file

2001 rmsd.dat the simulation consists of 2000 time steps (50 ps each in column 1)

head rmsd.dat header command shows the first 10 lines in the file

tail rmsd.dat tail shows the last 10 lines; column 2 is RMSD in units of nanometers


The data file contains two columns of continuous XY data. A useful Unix program for plotting scientific data is gnuplot. Gnuplot doesn't work well within cygwin, so I produced the files for you.


The following rmsd.gpl script contains the commands for input into gnuplot:

set terminal png

set output "rmsd.png" #plot raw data:

plot '../rmsd.dat'


set encoding iso_8859_1

set terminal postscript enhanced "Helvetica" 12.0

set size 0.5

set output "rmsd.eps" #publication quality plot:

set xlabel "time (ns)" offset .6 font "Helvetica-Bold,14"

set ylabel "RMSD ({\305})" offset .4 font "Helvetica-Bold,14"

set ytics 1

set mxtics 2

set yrange [0:5]

plot '../rmsd.dat' using ($1/1000):($2*10) w lines linecolor rgb "black" linewidth 2 notitle


Gnuplot is its own command interpreter. You can enter the commands in real time at the gnuplot> prompt, followed by ^D to quit. Or the above .gpl script can be run from the Unix shell: % gnuplot rmsd.gpl

Figure 1. Receptor conformation was measured versus simulation time using the protein's instantaneous root-mean-square deviation (RMSD) relative to that of the starting open structural conformation. The receptor quickly adopts a closed conformation when the small molecule ligand/drug is bound (3.9 angstrom RMSD), and then undergoes a few more spontaneous transitions between the two states. Individual data points (+) are plotted for the PNG image (Left: nanometers vs. picoseconds). A single line is traced in the EPS file (Right).

______________________________________________________

Our gnuplot script plotted the data in two different formats. Left: rmsd.png is a png graphics file format.. a bitmap image whose text will appear pixelated if you try to zoom in. Right: rmsd.eps is a publication quality Encapsulated PostScript file. Like the PDF file format, EPS is capable of storing diagrams and text as vector graphics files. The text will look perfect no matter how much you zoom in/scale up. Trying zooming in by opening the file here: rmsd.eps. Also note in the EPS file how we scaled the x- and y-axes by a multiplicative factor ($1/1000, $2*10) to make the units easier to read.


What are we looking at in Figure 1? This is time series data. The x-axis consists of equally spaced time increments, or snapshots in a molecular dynamics simulation. The y-axis records some reaction coordinate. The data comes from a computer simulation of a protein receptor. We are monitoring how the protein moves (structural conformation changes) in time to bind a drug molecule. Looking at the data, you'll see that the protein switches back and forth between two average values of the progress variable (bimodal), which correspond to two different conformational states of the receptor.


The x series data starts from time zero and is incremented by 50 ps a total of 2000 times, representing 100 nanoseconds of real time simulation. One picosecond is 10-12 seconds; 1 ns = 10-9 s, or 1 billionth of a second.


Q9: What is the real time duration of a simulation consisting of one thousand 1 ps timepoints? If this simulation took 96 hours of computer time on one CPU thread, how long would you expect it to take on a quad core (4 physical processors) running a total of 8 virtual threads? Assume the code scales perfectly in parallel calculations.


What are the y-values in Figure 1? Return to our sample PDB files. A protein structure, or conformation, is defined by the positions, or coordinates of all of its atoms in 3-dimensions. That is, the X, Y, and Z coordinates of each atom in the protein are listed in

the pdb file. (Look at your favorite protein in the DataBank.) By convention the position is listed in units of Angstroms (1 angstrom = 10-10 meters = 10 nm).


We need a single progress variable that distinguishes one protein conformation from another when the conformation depends on 3N variables, where N is the number of atoms. The variable commonly used is root-mean-square deviation, or RMSD.

Recall the formula for the standard deviation, s, summed over all data points i = 1, 2, .., N from a sampling distribution:

s = [ SUM(xi-<x>)2/(N - 1) ]0.5 Equation 4

Here, the brackets '< >' denote an ensemble average, meaning we're averaging over all possible states of the system.


The standard deviation is a measure of the spread of the distribution about the mean. In protein structure, we use RMSD to refer to the root mean square of the deviations of all atoms from their positions in some reference state. The larger the RMSD between two protein conformations, the more different they are.


The y-data in rmsd.dat are the RMSD of the current time state of the protein with respect to the reference structure it was in at the start of the simulation. The starting structure is actually an open conformation of the receptor. Relaxation during the simulation shows that the protein prefers to collapse into a closed conformation 70% of the time (the closed state binds drug more tightly). The collapsed

state has an average RMSD of 0.39 nm or 3.9 angstroms from the open reference state. Tomorrow we will discuss how to draw the dividing line between the two states.


How would we calculate the free energy change for the conformational change as a function of RMSD using Equation 1? We need to know the probability as a function of RMSD to do this, but right now rmsd.dat is time series data. To convert to probabilities, we need to construct a histogram. In other words, for a given value of RMSD, how many times was it observed.


We will return to the problem of histogram generation in the next tutorial on programing, since numerical calculations can't be done using the Unix shell alone.


Save a record of the Unix commands you entered today:

history > ~/day2/history.out

Tomorrow, we will be creating histograms in Tutorial 3.