Tutorial 2) Part 4: Awk Programming

Awk is a data-driven scripting language consisting of a set of actions taken on streams of textual data, either run directly on files or as part of a pipeline. A standard feature of Unix-like operating systems, awk is used for text processing and data extraction.


Download the sample Protein Data Bank file: Files: pdb.txt


cd ~/day2

PC: cp /cygdrive/c/Users/username/Downloads/pdb.txt ~/day2

Mac: cp ~/Downloads/pdb.txt ~/day2

ls -lrt confirm the file was downloaded


If the download didn't work copy-and-paste the following commands into a Unix terminal:

# Atom# Atom Residue Chain Resid x y z

echo "ATOM 1 N MET A 28 11.579 13.109 10.857" > pdb.txt

echo "ATOM 2 CA MET A 28 10.243 12.524 10.957" >> pdb.txt

echo "ATOM 3 C MET A 28 9.813 12.069 12.377" >> pdb.txt

echo "ATOM 4 O MET A 28 8.675 12.379 12.739" >> pdb.txt

echo "ATOM 5 CB MET A 28 10.010 11.433 9.898" >> pdb.txt

echo "ATOM 6 CG MET A 28 8.549 11.290 9.496" >> pdb.txt

echo "ATOM 7 SD MET A 28 8.299 10.450 7.910" >> pdb.txt

echo "ATOM 8 CE MET A 28 8.650 11.788 6.773" >> pdb.txt

echo "ATOM 9 N PRO A 29 10.637 11.357 13.207" >> pdb.txt

echo "ATOM 10 CA PRO A 29 10.152 10.982 14.554" >> pdb.txt

echo "ATOM 11 C PRO A 29 10.140 12.135 15.573" >> pdb.txt

echo "ATOM 12 O PRO A 29 9.493 12.012 16.618" >> pdb.txt

echo "ATOM 13 CB PRO A 29 11.096 9.851 14.971" >> pdb.txt

echo "ATOM 14 CG PRO A 29 12.363 10.136 14.256" >> pdb.txt

echo "ATOM 15 CD PRO A 29 12.018 10.869 12.983" >> pdb.txt


Note the a real protein will have >1,000 atoms, each with an xyz location in the structure file.. Let's demonstrate how Awk can extract columns of data for subsequent processing:

awk '{print $7,$8,$9}' pdb.txt

Awk is smart enough to group the variable data types into individual columns based on the existence of whitespace between the values on a given line. The whitespace can be a combination of one or more blank spaces and tabs (\t). In the old days of Fortran, your program would have to specify how many characters wide the string variable was, how long the integer variable was, and in which character locations it could be found!

Try adding some whitespace in pdb.txt and verify that awk interprets the columns correctly if they are not all lined up. If you delete the 9th entry on one line it should just omit it, keeping all the entries in register.


Like the pattern searching functions in grep and Vi, awk can return just the lines matching a search string (/ /) or regular expression..

awk '/CA/ {print "CAatom:",$2}' pdb.txt extracts atom number for alpha carbons

awk '! /C/ {print}' pdb.txt (! extracts lines without string "C")

awk '/C[AB]/ {print $3,$2}' pdb.txt (match any character enclosed in [ ] )


Grep of course has simpler syntax but is limited in function:

grep CA pdb.txt

Awk can use logical operators, performing an if-then conditional on each line ($0 represents entire line):

awk '{if ($7 < 10) print $0}' pdb.txt


Print every 3rd line from line 2 (lines 2, 5, 8..):

awk 'NR % 3 == 2' pdb.txt

Extract lines 3 to 6:

awk 'NR==3,NR==6 {print NR,$0}' pdb.txt

Print line (record) numbers/number of fields:

awk '{print NR,$0,NF} END {print "lines:",NR}' pdb.txt


Complex math can also be performed in awk. In scientific applications we'll be taking natural logarithms (base e). In programming languages natural log is 'log' in contrast to base 10 log which is log.10.

awk '{print -0.616*log($7)}' pdb.txt > pdb.out

When taking logs in programs it is important not to take the log of zero, which is mathematically undefined. Here, we pipe an empty string to awk just to test its natural log function:

echo "" | awk '{print log(2.718)}'

echo "" | awk '{print log(0)}'

awk 'BEGIN {print log(-1)}'

awk 'BEGIN {print sqrt(3.99)}'


Awk is generalizable scripting language, allowing for real programming, even without a data file..

awk 'BEGIN {initializations}

/search pattern 1/ {program actions}

/search pattern 2/ {program actions}

END {final actions}' <file>


awk 'BEGIN {for(i=1;i<=6;i++) print "square of",i,"is",i*i}'


Let's write a program, contained in the file pd.awk, to compute the molecular mass of our protein:

BEGIN {print "Calculating mass of peptide.."}

$1~/ATOM/ && $3~/^C/ {mass += 12.011; print "atom",$2,"carbon"}

$1~/ATOM/ && $3~/^N/ {mass += 14.007; print "atom",$2,"nitrogen"}

$1~/ATOM/ && $3~/^O/ {mass += 15.999; print "atom",$2,"oxygen"}

$1~/ATOM/ && $3~/^S/ {mass += 32.06; print "atom",$2,"sulfur"}

END {print " Molecular mass =",mass/1000,"kilodaltons (kDa)"}


pd.awk looks for entries that have both ATOM in field 1 AND (&&) have a field 3 starting with (^) a C, N, or O character. To execute our script file, use awk -f at the Unix prompt:

awk -f pd.awk pdb.txt


Q7: Use pd.awk to determine the mass of the dipeptide in pdb.txt in Daltons. Find the molecular weight of a real protein in kilodaltons: Files: 3ink.pdb. Next, modify the pd.awk program to print out the total number of carbon atoms contained in the molecule. Show your code and your results.


_________________________________________________________________________

Part 5: Application to Molecular Modeling

A useful formula in molecular/drug modeling applications is:

dG o = -RT ln Keq Equation 1

The Gibbs equation relates the free energy change of a chemical reaction to the equilibrium constant Keq, which is the ratio of [products]/[reactants]. R is the ideal gas constant and T is the thermodynamic temperature in Kelvin. At 37 Celsius, or 310 K, RT = 0.616 kcal/mol, which is a unit of energy. It turns out that 0.6 kcal is a measure of the thermal energy due to random fluctuations at room temperature. In other words, a good drug should bind its target with energy several times thermal energy (5-10 kcal). Because of the log relationship, for each kcal in stabilization (binding energy), a drug is ~5 times more likely to be bound to its target. In other words, the Keq in Equation 1 can be thought of as a probability.


In real applications we deal with a general process from state A to state B:

A <=> B, Keq =[B]eq /[A]eq Equation 2

Instead of arbitrarily saying the system is simply in either of one of the two states, we need to define a reaction or progress coordinate specifying how far the system has proceeded along the continuum from state A to state B. In chemistry, this is called the instantaneous reaction quotient: Q = [B] / [A] for a system not at equilibrium. In computational applications instead of Keq we compute the probability p of a given value of the generic progress coordinate r : p (r) for a continuous transition between bound and unbound drug, for example.


Solving equation 1 for p: dG = -RT ln p, we get

p(r) = exp[-dG(r)/RT] Equation 3

In programming 'exp' denotes e = 2.718.. raised to some power:

awk 'BEGIN {print exp(1)}'

awk 'BEGIN {print exp(2)}'

Alternatively we can approximate e2 using a few decimal places..

echo "" | awk '{print 2.718**2}' (** denotes exponent)


Note from Equation 3 that the higher the energy of a state of the system, the lower its probability. This is a thermodynamic principle: systems tend to minimize their free energy. This stems from the second law of thermodynamics, that systems tend to maximize their entropy (disorder).


Q8: For the reaction of some chemical species A into B, what percent of the system is in state B when dG = +1 kcal, 2 kcal, 3 kcal or 4 kcal (assume equilibrium)? If you use the following command, do your results agree with the factor of 5 relationship?

awk 'BEGIN {print exp(-1/0.616)}'


Let's look at some real data generated from a molecular dynamics simulation by the Hills Lab. Download: Files: rmsd.dat. We will continue working with this data in Tutorial 3..

cd ~/day2

PC: cp /cygdrive/c/Users/username/Downloads/rmsd.dat ~/day2

Mac: cp ~/Downloads/rmsd.dat ~/day2


ls -lrt rmsd.dat contains 53828 bytes (54 KB)

wc -l * the word count command lists the number of lines in each file

2001 rmsd.dat the simulation consists of 2000 time steps (50 ps each in column 1)

head rmsd.dat header command shows the first 10 lines in the file

tail rmsd.dat tail shows the last 10 lines; column 2 is RMSD in units of nanometers


The data file contains two columns of continuous XY data. A useful Unix program for plotting scientific data is gnuplot. Gnuplot doesn't work well within cygwin, so I produced the files for you.


The following rmsd.gpl script contains the commands for input into gnuplot:

set terminal png

set output "rmsd.png" #plot raw data:

plot '../rmsd.dat'


set encoding iso_8859_1

set terminal postscript enhanced "Helvetica" 12.0

set size 0.5

set output "rmsd.eps" #publication quality plot:

set xlabel "time (ns)" offset .6 font "Helvetica-Bold,14"

set ylabel "RMSD ({\305})" offset .4 font "Helvetica-Bold,14"

set ytics 1

set mxtics 2

set yrange [0:5]

plot '../rmsd.dat' using ($1/1000):($2*10) w lines linecolor rgb "black" linewidth 2 notitle


Gnuplot is its own command interpreter. You can enter the commands in real time at the gnuplot> prompt, followed by ^D to quit. Or the above .gpl script can be run from the Unix shell: % gnuplot rmsd.gpl