grep, sed, awk

Grep

The command grep will print the lines matching a given pattern.

grep PATTERN file

grep -e PATTERN file (Pattern uses regex)

Understanding grep with a simple fasta file:

>contig1

AATCTAGCATTTACGTAGTAGCTAAAGCTAAACCTCAGGGGCTACTTTAT

>contig2

ATTTACGTAGCATCAAATCTAGCATTTACGTAGTAGCTAAAGCTATTACG

Find a specific sequence within our sequences:

grep "AGGGG" file.fasta

-> will print only the first sequence: AATCTAGCATTTACGTAGTAGCTAAAGCTAAACCTCAGGGGCTACTTTAT

Find sequence headers only:

grep ">" file.fasta

-> will print all fasta headers:

>contig1

>contig2

Count number of sequences:

grep ">" file.fasta | wc -l

-> will count how many lines contain ">", which will match with the number of sequences: 2

we can also use the flag -c in grep to do the same:

grep -c ">" file.fasta

Print the DNA sequences with no headers:

grep -v ">" file.fasta

-> will print all lines that do not contain ">":

AATCTAGCATTTACGTAGTAGCTAAAGCTAAACCTCAGGGGCTACTTTAT

ATTTACGTAGCATCAAATCTAGCATTTACGTAGTAGCTAAAGCTATTACG

Sed

sed ("stream editor") is a tool that can parse a file line by line, and transform text, using a compact programming language that can fit in one line. Sed is a powerful tool with a big array of possible commands, but the most common one is the substitution, in which we find a pattern and substitute it for another string.

sed 's/patternA/patternB/' file.txt

Understanding sed with a simple fasta file:

>contig1 assembled a

AATCTAGCATTTACGTAGTAGCTAAAGCTAAACCTCAGGGGCTACTTTAT

>contig2 assembled b

ATTTACGTAGCATCAAATCTAGCATTTACGTAGTAGCTAAAGCTATTACG

Modify the fasta header to contain "sequence" instead of "contig":

sed 's/contig/sequence/' file.fasta

We will obtain the entire file with the replacement:

>sequence1 assembled 2025 a

AATCTAGCATTTACGTAGTAGCTAAAGCTAAACCTCAGGGGCTACTTTAT

>sequence2 assembled 2025 b

ATTTACGTAGCATCAAATCTAGCATTTACGTAGTAGCTAAAGCTATTACG

Substitute spaces for underscores, in order to avoid problems with other programs:

In this case we add the flag "g" at the end, to make sure it replaces each occurrance even if there is multiple within the same line:

sed 's/ /_/g' file.fasta

We will obtain the entire file with the replacement:

>contig1_assembled_2025_a

AATCTAGCATTTACGTAGTAGCTAAAGCTAAACCTCAGGGGCTACTTTAT

>contig2_assembled_2025_b

ATTTACGTAGCATCAAATCTAGCATTTACGTAGTAGCTAAAGCTATTACG

Simplify a fasta header:

sed 's/ .*//' file.fasta

We use regular expressions to match the first space we find in a line, followed by any character (.), any number of times (*)

We will obtain the entire file with the replacement:

>contig1

AATCTAGCATTTACGTAGTAGCTAAAGCTAAACCTCAGGGGCTACTTTAT

>contig2

ATTTACGTAGCATCAAATCTAGCATTTACGTAGTAGCTAAAGCTATTACG

Extra sed options

One interesting function of sed is that we can use any character as substitution character! Thus, instead of always having to have the format 's/a/b/', we can use any other character, such as | @ ^ ! =. For example, this is extremely useful when wanting to change paths from a file, which will use the character / already. For example, we want to change an imaginary path from /home/evomics/bin to /usr/bin, then, we could type:

sed ‘s@/home/evomics/bin@/usr/bin@g’

This could also be fixed by escaping the character, which means, forcing the character to be read as just the string text it is. For that, we use the backslash \:

sed ‘s/\/home\/evomics\/bin/\/usr\/bin/g’

As you can see, sed commands can become quite difficult to read sometimes. It is good to find the delimiters first, and try to go one part at a time to read it and understand what it is doing:

sed ‘s/\/home\/evomics\/bin/\/usr\/bin/g’

The beauty of sed is the possibility to keep the code to one-liners. We can do multiple substitutions at once in one same file with one only command, for that, we would separate it with the character ;. For example:

sed 's/patternA/patternB/g;s/patternC/patternD/g' file.txt > output.txt

Finally, an important feature of sed is the possibility to save information from initial pattern. One way to do that is to use the symbol & on the replacement delimiter, which will translate into the entire first pattern.

sed 's/patternA/>&/g' file.txt > output.txt

The above command will transform every instance of patternA to >patternA (for example, creating fasta headers).

We can also save groups of the pattern using parenthesis:

sed 's/$pattern$A/\1\ B/g' file.txt > output.txt

The command above will save the string pattern, in between the parenthesis, and be referred in the substitution pattern as \1. The command would change every patternA to pattern B, adding a space as well, which we also need to escape with the backslash (\). We can save as many groups of patterns as we want, and refer to them as \1, \2, etc.

Finally, remember the power of regular expressions!!

You can use regular expressions to find a series of patterns that have something in common, and change all of them at the same time.

AWK

AWK is a language designed for text processing, like sed and grep. AWK is a standard feature of most Unix-like operating systems. AWK reads one line at a time, searching for a specific pattern to execute the desired action. It requires a condition, and an action:

awk condition {action} file.txt

AWK is a language field aware (column aware):

$0 refers to the whole line

$1, $2, $3 ... refers to columns 1, 2, 3 ...

Understanding awk with a simple BED file:

contig1 20 1305 gene1 . +

contig1 4674 8563 gene4 . -

contig2 1239 5387 gene6 . -

contig3 546 3524 gene9 . +

Print only the lines containing genes in contig1:

awk '$1="contig1" {print}' file.bed

We would get the following printed out:

contig1 20 1305 gene1 . +

contig1 4674 8563 gene4 . -

Print only the lines containing genes in contig1 AND in forward orientation:

awk '$1="contig1" && $6="+" {print}' file.bed

We would get the following printed out:

contig1 20 1305 gene1 . +

Count how many genes we have in our file:

awk '$1="contig"' file.bed | wc -l

or we can use purely AWK syntax. We can create a counter after each condition is met, in this case finding the word contig in column 1. And we use the function END to mark that an extra action is done when all lines are finished being parsed:

awk '$1="contig" {count++} END {print count}' file.bed

both these commands will print: 4

We can also use the function BEGIN to add an action before we start parsing the lines in our file:

awk 'BEGIN {print "We have these many genes:"} $1="contig" {count++} END {print count}' file.bed

This command will print

We have these many genes:

Finally, we can combine information in multiple columns to create our conditions.

Print out the gene names of all genes that are larger than 2000 bp:
We need can use the information in column 2 and 3, which marks the start and end of each gene, and we will print the information in column 4 (gene name) if column 3 - column 2 is larger than 2000:
awk '($3 - $2 > 2000) {print $4}' file.bed

It will print:

gene4

gene6

Extra awk options

The condition can be any logical statement:

$3 > 0 value in column 3 greater than 0

$1 == 36 value in column 1 equals 36

$1 == $3 value in column 1 equals the value in column 3

$2 == "pattern" value in column 2 contains the string "pattern"

If condition is true, everything in {...} is executed.

We can specify when some actions get executed, using the function BEGIN, or END:

awk BEGIN {action} condition {action} The first action will be executed only once at the start.

awk condition {action} END {action} The last action will be executed only once at the end.

We can also put everything together:

awk BEGIN {action} condition {action} END {action}

We are also able to concatenate actions:

awk condition {action1; action2; action3}

Actions

AWK comes with pre built-in functions:

length(x) length of the field

print(x) print a field

rand() generate random number

sqrt(x) calculate square root of x

sub(x,y) substitute x for y

And we can define our own variables, such as:

n = n + 1 increment n each line

n = $2 * $3 multiply column 2 and 3 in each line and save it in n

Operators to use in conditionals

Assignment operators
= assignment

+= addition

-= substraction

*= multiplication

/= division

Conditional operators

> greater than (< less than)

>= greater or equal than (<= less or equal than)

== equal to

!= not equal to

&& (AND) both the conditionals should be true

|| (OR) any of the conditionals can be true

Regular expression operators

~ match pattern

!~ do not match pattern

The delimiter of the field variables is by default a single white space (which includes tab), but it can be anything. The field separator can be specified in the condition part of the command, for example, if the fields are comma-separated:

awk -F',' '{print $3}' file.txt

awk -FS"," '{print $3}' file.txt

Some of the variables that can be recalled during the condition or action parts of the code are:

NR: Number of records that were input from the data file.

FNR: File number of records, the total number of input records in the input file.

NF: Number of fields in the input line. We can refer to the last field in the input as $NF, the 2nd-to-last field as $(NF-1), etc.

FILENAME: Name of the current input file.

FS: Field separator that is used to divide the fields in the input line. As seen above, it can be re-assigned to any different delimiter than white space.

RS: Record separator. By defaults it is newline, but it can be reassigned.

OFS: Output field separator assigns the field separator of the output. The default is the space character.

ORS: Output record separator assigns the record separator of the output. The default is a newline character.

How to put all this together?

Sum a series of numbers in the first column:

awk '{sum+=$1} END {print sum}'

Sum a series of numbers in the first column, only if they are larger than 500:

awk '$1 > 500 {sum+=$1} END {print sum}'

Add line numbers to the output:

awk '{print NR, $0}'

Print the length of each line:

awk '{print length($0)}'

Compute the average of the sum of values in column 1:

awk '{sum+=$1} END {print sum/NR}'

Some extra examples on how to use awk on some bioinformatic files:

To print only the coordinate columns from the BED file:

awk '{print $1,$2,$3}' file.bed

We can also print only the lines that contain a simple repeat (information contained in the 7th column):

awk '$7=="Simple_repeat"' file.bed

We can remove lines that contain an Unknown repeat:

awk '$7!="Unknown"' file.bed

We can combine multiple filters, and remove lines that are Unknown and lines that are Unspecified.

awk '$7!="Unknown" && $7!="Unspecified"' file.bed

Some repeats have multiple possible patterns in column seven. For instance, satellites can be specified as "Satellite" or "Satellite/macro". If we want to print all lines with this repeat, we could use this command:

awk '$7=="Satellite" || $7=="Satellite/macro"' file.bed

Alternatively, we could use a regular expression, and print all lines that start with "Sat"

awk '$7 ~ /^Sat/' file.bed

Page updated

Google Sites

Report abuse