Exercises

Time to put into practice everything you learned!

Inside the folder /home/genomics/workshop_materials/unix_tutorial, you will find a folder called data. It contains a dataset that includes:

genome_assembly.fasta
sequencing_reads.fastq
variant_analysis.vcf
annotation.gff

Let's start by setting our working directory!

For each of the files in the folder data, produce a symlink inside the folder working_directory, within the unix folder.

Now we can easily work with the files without having to copy them or risk modifying them with the command mv.

To keep the same names of the files and link them all at the same time, you can run:

ln -s /home/genomics/workshop_materials/unix_tutorial/data/* . (The command assumes you are in working_directory)

FASTA

Beginner

How many sequences does the FASTA file contain?

Extract sequence headers and save them to a separate file

Intermediate

Simplify the fasta header to only contain >NODE_X (NODE, followed by its unique contig number).

Expert

Reduce fasta header to a maximum of 5 characters (making sure headers are unique)

Solutions FASTA

How many sequences does the FASTA file contain?

grep -c "^>" fasta_file.fasta

sed -n '/^>/p' fasta_file.fasta | wc -l

awk '/^>/{count++} END{print count}' fasta_file.fasta

Extract sequence headers and save them to a separate file

grep "^>" fasta_file.fasta > headers.txt
sed -n '/^>/p' fasta_file.fasta > headers.txt

awk '/^>/{print}' fasta_file.fasta > headers.txt

Simplify the fasta header to only contain >NODE_X (NODE, followed by its unique contig number).

sed '/^>/ s/$>NODE_[0-9]*$.*/\1/' fasta_file.fasta > simplified_fasta.fasta
We need to escape the parentheses because sed only uses basic regular expressions, which properly identifyies symbols like ^, *, or .
But we can add the flag -E to mark that we want to use the parentheses as regular expressions:
sed -E 's/(>NODE_[0-9]*).*/\1/' fasta_file.fasta > simplified_fasta.fasta

awk -F"_" '/^>/ {print $1 "_" $2} !/^>/ {print}' fasta_file.fasta > simplified_fasta.fasta

Reduce fasta header to a maximum of 5 characters (making sure headers are unique)

We can make sure we grab the number after NODE_, which can have up to 4 digits, and add our own character before to make it a maximum of 5 characters (for example, the character c for contig)

sed '/^>/ s/^>NODE_$[0-9]\{1,4\}$.*/>c\1/' fasta_file.fasta > simplified_fasta.fasta
sed '/^>/ s/^>NODE_$[0-9]\+$.*/>c\1/' fasta_file.fasta > simplified_fasta.fasta
awk -F "_" '/^>/ {print ">c" $2} !/^>/ {print}' fasta_file.fasta > simplified_fasta.fasta

FASTQ

Beginner

How many reads does the fastq files contain?

Expert

Calculate the average length of the reads.

Solutions FASTQ

How many reads does the fastq files contain?

grep -c '^@SRR' fastq_file.fastq
sed -n '/^@SRR/p' fastq_file.fastq | wc -l

awk '/^@SRR/ {count++} END {print count}' fastq_file.fastq

Calculate the average length of the reads

awk 'NR % 4 == 2 { total_length += length($0); count++ } END { if (count > 0) print total_length / count }' fastq_file.fastq

NR % 4 == 2: This condition selects every second line (the sequence lines in a FastQ file).
total_length += length($0): Adds the length of each sequence to the total_length variable.
count++: Counts the number of sequences processed.
END { if (count > 0) print total_length / count }: After processing all lines, it calculates and prints the average length of the reads.

We can also use the length specified in the header:
grep '^@SRR' fastq_file.fastq | sed 's/.*length=$[0-9]*$.*/\1/' | awk '{ total_length += $1; count++ } END { if (count > 0) print total_length / count }'

grep '^@SRR' fastq_file.fastq: Extracts all header lines that start with @SRR
sed 's/.*length=$[0-9]*$.*/\1/': Uses sed to extract the numeric value following length= from the header line.
awk '{ total_length += $1; count++ } END { if (count > 0) print total_length / count }': Adds the extracted length to total_length and counts the number of sequences, then calculates the average.

VCF

Beginner

Save to a new file all the lines that are variants, excluding the VCF metadata

How many variants have been called?

Intermediate

How many of those are SNPs?

How many are INDELs?

Expert

How many transitions appear across all SNPs?

How many transversions?

Solutions VCF

Save to a new file all the lines that are variants, excluding the VCF metadata

grep -v '^#' input.vcf > variants_only.vcf

sed '/^#/d' input.vcf > variants_only.vcf

matches lines starting with #

the flag "d" means delete
awk '$1 !~ /^#/' input.vcf > variants_only.vcf

$1: Refers to the first field (column) of each line

!/^#/: Matches lines where the first field does not start with #

Redirect the output to variants_only.vcf

How many variants have been called?

grep -v '^#' input.vcf | wc -l

grep -v -c '^#' input.vcf

awk '!/^#/' input.vcf | wc -l

awk '!/^#/{count++} END {print count}' input.vcf

How many of those are SNPs?

grep -v '^#' input.vcf | grep -c 'TYPE=snp'

grep -v '^#' input.vcf | grep 'TYPE=snp' | wc -l

awk '!/^#/ && /TYPE=snp/ {count++} END {print count}' input.vcf

How many are INDELs?

grep -v '^#' input.vcf | grep -E 'TYPE=(ins|del)' | wc -l

awk '!/^#/ && /TYPE=(ins|del)/ {count++} END {print count}' input.vcf

How many transitions appear across all SNPs?

To count transitions (Ts) and transversions (Tv) among SNPs in a VCF file, we compare the REF and ALT columns. Transitions occur between purines (A ↔ G) or pyrimidines (C ↔ T), while transversions involve purine ↔ pyrimidine changes (A ↔ C, A ↔ T, G ↔ C, G ↔ T).

awk '!/^#/ && /TYPE=snp/ { if (($4 == "A" && $5 == "G") || ($4 == "G" && $5 == "A") || ($4 == "C" && $5 == "T") || ($4 == "T" && $5 == "C")) { transitions++ } } END {print "Transitions:", transitions}' input.vcf

How many transversions?

awk '!/^#/ && /TYPE=snp/ { if (($4 == "A" && ($5 == "C" || $5 == "T")) || ($4 == "G" && ($5 == "C" || $5 == "T")) || ($4 == "C" && ($5 == "A" || $5 == "G")) || ($4 == "T" && ($5 == "A" || $5 == "G"))) { transversions++ } } END {print "Transversions:", transversions}' input.vcf

GFF

Beginner

How many genes have been predicted/annotated? Be aware that some lines are headers, and we also have the entire genome sequence at the end! Find some common pattern shared only by the annotated gene lines.

Intermediate

How many genes we failed to identify with an actual protein name?

Expert

How many genes are larger than 2,000 basepairs?

Solutions GFF

grep -c 'ID=' input.gff

awk '/ID=/' input.gff | wc -l

awk '/ID=/ {count++} END {print count}' input.vcf

You can also decide to only check the coding sequences, in that case, the commands would be:

grep -c 'CDS.*ID=' input.gff
awk '$3 == "CDS" && /ID=/' input.gff | wc -l

How many genes we failed to identify with an actual protein name?

grep -c 'product=hypothetical protein' input.gff

awk '/product=hypothetical protein/' input.gff | wc -l

awk '/product=hypothetical protein/ {count++} END {print count}' input.gff

How many genes are larger than 2,000 basepairs?

You can identify genes larger than 2,000 base pairs by calculating the difference between the start and end positions in the GFF file.
awk '$3 == "CDS" && ($5 - $4 + 1) > 2000' input.gff | wc -l
awk '$3 == "CDS" && ($5 - $4 + 1) > 2000 {count++} END {print count}' input.gff

Page updated

Google Sites

Report abuse