awk

AWK

AWK is a language designed for text processing, like sed and grep. AWK is a standard feature of most Unix-like operating systems. AWK reads one line at a time, searching for a specific pattern to execute the desired action. It requires a condition, and an action

awk condition {action} file.txt

AWK is a language field aware (column aware):

$0 refers to the whole line

$1, $2, $3 ... refers to columns 1, 2, 3 ...


The condition can be any logical statement:

$3 > 0 value in column 3 greater than 0

$1 == 36 value in column 1 equals 32

$1 == $3 value in column 1 equals the value in column 3

$2 == "pattern" value in column 2 contains the string "pattern"

If confition is true, everything in {...} is executed.


We can specify when some actions get executed, using the function BEGIN, or END:

awk BEGIN {action} condition {action} The first action will be executed only once at the start.

awk condition {action} END {action} The last action will be executed only once at the end.

We can also put everything together:

awk BEGIN {action} condition {action} END {action}


We are also able to concatenate actions:

awk condition {action1; action2; action3}


AWK comes with pre built-in functions:

length(x) length of the field

print(x) print a field

rand() generate random number

sqrt(x) calculate square root of x

sub(x,y) substitute x for y 

And we can define our own variables, such as:

n = n + 1 increment n

n += $2 * $3 multiply n


The delimiter of the field variables is by default a single white space (which includes tab), but it can be anything. The field separator can be specified in the condition part of the command, for example, if the fields are comma-separated:

awk -F',' '{print $3}' file.txt

or

awk -FS"," '{print $3}' file.txt


Some of the variables that can be recalled during the condition or action parts of the code are:

NR: Number of records that were input from the data file.

FNR: File number of records, the total number of input records in the input file.

NF: Number of fields in the input line. We can refer to the last field in the input as $NF, the 2nd-to-last field as $(NF-1), etc.

FILENAME: Name of the current input file.

FS: Field separator that is used to divide the fields in the input line. As seen above, it can be re-assigned to any different delimiter than white space.

RS: Record separator. By defaults it is newline, but it can be reassigned.

OFS: Output field separator assigns the field separator of the output. The default is the space character.

ORS: Output record separator assigns the record separator of the output. The default is a newline character.


How to put all this together?

Sum a series of numbers in the first column:

awk '{sum+=$1} END {print sum}'

Sum a series of numbers in the first column, only if they are larger than 500:

awk '$1 > 500 {sum+=$1} END {print sum}'

Add line numbers to the output:

awk '{print NR, $0}'

Print the length of each line:

awk '{print length($0)}'

Compute the average of the sum of values in column 1:

awk '{sum+=$1} END {print sum/NR}'


Some extra examples on how to use awk on some bioinformatic files:

To print only the coordinate columns from the BED file:

awk '{print $1,$2,$3}' file.bed

We can also print only the lines that contain a simple repeat (information contained in the 7th column):

awk '$7=="Simple_repeat"' file.bed

We can remove lines that contain an Unknown repeat:

awk '$7!="Unknown"' file.bed

We can combine multiple filters, and remove lines that are Unknown and lines that are Unspecified.

awk '$7!="Unknown" && $7!="Unspecified"' file.bed

Some repeats have multiple possible patterns in column seven. For instance, satellites can be specified as "Satellite" or "Satellite/macro". If we want to print all lines with this repeat, we could use this command:

awk '$7=="Satellite" || $7=="Satellite/macro"' file.bed

Alternatively, we could use a regular expression, and print all lines that start with "Sat"

awk '$7 ~ /^Sat/' file.bed