The wonderfully awesome world of awk
infile.txt
14 10 22 TRAF317 32 53 TP5313 45 78 RB1Manipulate text files with awk:
Move column(s) (this moves column 4 from input to column 1 in output and prints columns 1, 2, 3 in order afterwards)
awk '{print $4, $1, $2, $3}' infile.txt > outfile.txt
TRAF3 14 10 22TP53 17 32 53RB1 13 45 78Add New column (this adds a new column between columns 3 and 4 with the text "tumor_suppressor")
awk '{print $1, $2, $3, "tumor_suppressor", $4}' infile.txt > outfile.txt
14 10 22 tumor_suppressor TRAF317 32 53 tumor_suppressor TP5313 45 78 tumor_suppressor RB1Create new file with just the unique lines (this keeps the very first entry only in the list)(pss- Thanks to Nizar Bahlis for finding this one)
example.txt (column1 = chromosome ; column2 = position ; column3 = Gene)
1 1234 TRAF31 1234 BRAF1 1234 TRAF32 1234 KRASawk '!x[$0]++' example.txt
1 1234 TRAF31 1234 BRAF2 1234 KRASCreate new file with just the unique entries based on a single column (this keeps the very first entry only in the list)
awk '!x[$1]++' example.txt
1 1234 TRAF32 1234 KRASCalculations with awk:
Mathematical Operators -
--------------------------------------------------------
Operator Meaning
+ addition
- subtraction
* multiplication
/ division
% modul (remainder after division)(ie. 12-(12/5)=2
--------------------------------------------------------
Calculate column sum
awk '{sum+=$1} END {print sum}' infile.txt
44Calulate column average
awk '{sum+=$1} END {print sum/NR}' infile.txt
14.66666...Calculate row sum
awk '{sum=0; for(var=2;var<=NF;var++) sum = sum+$var; print sum}' MyTest.txtCalculate row sum, count and average
awk 'BEGIN {FS=OFS="\t"}{sum=0; n=0; for(var=2;var<=NF;var++){sum+=$var; ++n}print $0, sum, n, sum/n}' MyTest.txtsCalculate using a constant value (print all columns, add 5 to each value in column 2)
awk '{print $1, $2+5, $3, $4}' infile.txt
14 15 22 TRAF317 37 53 TP5313 50 78 RB1Calculate using a script variable (variable set to 10, subtract variable value from column 1 and print the result followed by column 4)
VAR=10
awk -v var1="$VAR" '{print $1-var1, $4}' infile.txt
4 TRAF37 TP533 RB1Calculate using multiple script variables (variable 1 set to 5, variable 2 set to 4, subtract variable 1 from column 1 and add variable to to column 3, print all four columns)
VAR1=5
VAR2=4
awk -v var1="$VAR1" -v var2="$VAR2" 'BEGIN{print $1-var1, $2, $3+var2, $4}'
9 15 26 TRAF312 37 57 TP538 50 82 RB1Calculate the Standard Deviation of a Column:
This calculates the Population Standard Deviation NOT the Sample Standard Deviation
awk '{sum+=$1; array[NR]=$1} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}print sqrt(sumsq/NR)}' input.txt
Searching with awk:
Though grep is a great way to search a file it is limited in that you can not limit the search to specific columns in the file, however, this is possible with awk
awk '{if($1 == 14) print $0}' input.txt
14 10 22 TRAF3# Nested if statementawk '{if(($3==3195107) && ($2==3192730 || $2==3194272)) print $0}'Search Operators -
--------------------------------------------------------
Operator Meaning
== is equal to
!= is not equal to
> is greater than
>= is greater than or equal to
< is less than
<= is less than or equal to
--------------------------------------------------------
Boolean Operators -
--------------------------------------------------------
Operator Meaning
&& AND
|| OR
!. NOT
--------------------------------------------------------
Print specific lines from a file
awk 'NR==2,NR==3' input.txt (same as [head -n3 input.txt | tail -n2] but you don't need to do the math)
17 32 53 TP5313 45 78 RB1For bigger files this is faster, as the above version will parse through the entire file till the end, which can take a long time on a 200 million line file NGS fastq file
awk 'NR==5,NR==12 {print; if(NR==12) exit}' input.txtThis prints the lines between the indicated line numbers 5 and 12 and then exits awk after line 12 indicated by the if statement
Find the line number with a specific feature
Use when you want to know what line number has something (ie. sometimes you need to manipulate a file based on line numbers)
awk '/FeatureToFind/{print FNR}' infile.txtFind lines with specific text lengths
To find lines were a column contains a specific character length (this example find lines were column 11 contains exactly 100 characters)
awk '{ if (length($11) == 100 ) print }'Input Format
Sometimes you have input files that might be tab-separated but within a column a space might exist causing awk to be default parse by both space and tabs
To force awk to parse the file columns by tab exclusively (FS = Input Field Separator) (OFS = Output Field Separator)
awk 'BEGIN { FS = "\t" ; OFS = "\t"} ; {if($10 == "true") print $11, $14, $1, $2, $4, $5, $9, $3}'Output Format
By default awk outputs files as space delimited text files
To Force the output to be tab-delimitated files
awk '{ OFS = "\t" ; print $1, $2-10, $3+10, $4, $5}' infile.txt > outfile.txtModify Specific lines in Single columns
This will replace any value greater than 2 in column 4 of the infile with 1.98
awk 'BEGIN{OFS="\t"}$4>2{$4=1.98}{print}' infile.txt > outfile.txtDelete Specific fields/cells within a file if they match a certain value
This will replace the value in column 5 if it begins with chr with a blank entry. It is case specific apparently
~ = match
^ = begins with
awk '{if($5~/^chr/) {$5=""}} {print $0}' infile.txt > outfile.txtgunzip -c GM12878_CORIELL_p8_CL_Whole_T2_A2SHK_K12483_A4G70_AACGTGAT_L001_R2_001.fastq.gz | paste - - - - > random_index.txt
cut -f2 random_index.txt | cut -c2-7 > temp1
cut -f4 random_index.txt | cut -c2-7 > temp2
paste random_index.txt temp1 temp2 > mb.txt
awk -F "\t" '$6 ~ /[\x35-\x49]/ && $6 !~/[\x20-\x34]/ {print $0}' mb.txt | cut -f5 | sort | uniq | wc -l
#I can't get the regular expression to pull the ascii strings that only have punctuation characters, but doing it the long-hand way seems to work
-bash-4.1$ awk -F "\t" '$5 !~/[N]/ && $6 ~/[56789:;<=>?@ABCDEFGHI]/ && $6 !~/[\x20-\x34]/ {print $0, "PASS"}' mb.txt | wc -l
3877152
-bash-4.1$ awk -F "\t" '$5 ~/[N]/ || $6 ~/[\x20-\x34]+/ {print $0, "FAIL"}' mb.txt | wc -l
684049
-bash-4.1$ wc -l mb.txt
4561201 mb.txt
Sort a flat file while keeping the header at the top of the file
It is often necessary to sort a file, but the native unix sort will sort the entire file, which may put the header line in a random place
#For a standard sortawk 'NR == 1; NR > 1 {print $0 | "sort"}' file.txt#Sort on column5awk 'NR == 1; NR > 1 {print $0 | "sort -k5"}' file.txt#Sort on column10 in numeric orderawk 'NR == 1; NR > 1 {print $0 | "sort -nk10"}' file.txt#Sort on column10 in reverse numeric orderawk 'NR == 1; NR > 1 {print $0 | "sort -nrk10"}' file.txt#Sort a comma separated file, sort on column 1 followed by columns 2 and 5 by numeric orderawk 'NR == 1; NR > 1 {print $0 | "sort -t',' -k 1,1 -k 2,2n -k 5,5n"}' file.csvx
#Extract the junctions matrix lines with the target junctions#Get the header and only those lines with the target junctions#Print every third column starting on column 5 as the file has 3 columns for each patienttabix -h MMRF_CoMMpass_IA9pub_RNA_junctions.txt.gz 3:3195106-3195107 | awk 'NR == 1 ; NR >1 {if(($3==3195107) && ($2==3192730 || $2==3194272)) print $0}' | awk -F "\t" '{for(i=5;i<=NF;i+=3)printf "%s%s", $i, (i+3>NF?"\n":FS)}' > CRBN_del10.txt1.69967