awk (and shell) oneliners

saving time and pain when processing data

Why use awk?

The most universal data format is a text file with the data in columns separated by white space or tabs. Keeping data in this format you can be sure that you will always be able to access it. You are not hostage to a particular application or operating system. This is an argument for the use of plain text and still not for awk, so why awk? It is surprisingly often that one needs to extract a single column of data out of a text file, or you need to reverse the order of the columns, or sum them up, or normalize them or ... whatever. It is a relatively simple operation and one could do it in C, or Matlab or Mathematica or Excel ... if only one is willing to sit and write a program or you have the necessary application in hand. Awk is extremely good at slicing text files in records (generally a line) and in fields (the entries, generally separated by spaces). This means that manipulating data in text files is very straightforward and many operations can be implemented as oneliners. Additionally, awk is a traditional unix tool and will play ball with all the rest of the shell, so it lives comfortably inside a pipe, can be scripted to act on a bunch of files identically and understands regular expressions. 

Awk can be found on any unix/linux system, on the Mac OS X (via terminal) and can also be installed in Windows (gawk).

I find it particularly helpful for preprocessing data before plotting with gnuplot. 

Finding the maximum of a column of data

Suppose I have a file data.dat with three columns of numbers in plain text. I want to get the maximum value in column 3.

> awk 'BEGIN {max = 0} {if ($3>max) max=$3} END {print max}' data.dat

Finding the sum or average of a column of data

Again text file data.dat with columns of numbers. I extract the sum and average of column number 1.

>awk '{sum=sum+$1} END {print sum}' data.dat

>awk '{sum=sum+$1} END {print sum/NR} data.dat

Joining two files by columns

This is a common problem. Suppose you have two data files which contain related data, for example they both provide {coordinate, value} pairs, and you would like to have them in the same file so that they can be more easily operated on. Supposing that what you need is just to paste together the columns, there is a handy unix utility (it is also built into Mac OS X) called paste that will do the job.

data1.dat has data in the format : x y value1

data2.dat has data in the format: x y value2

>paste data1.dat data2.dat > data3.dat

will give you

data3.dat with format: x y value1 x y value 2

If it is inconvenient to have the x y values repeated, you can get rid of them with a simple awk script

> awk '{print $1" "$2" "$3" "$6}' data3.dat > data4.dat

and end up with

data4.dat: x y value1 value2

Converting a column of data from hex to decimal

awk '{print ("0x"$1) +0}' filename