Inside the folder /home/genomics/workshop_materials/unix_tutorial, you will find a folder called data. It contains a dataset that includes:
genome_assembly.fasta
sequencing_reads.fastq
variant_analysis.vcf
annotation.gff
For each of the files in the folder data, produce a symlink inside the folder working_directory, within the unix folder.
Now we can easily work with the files without having to copy them or risk modifying them with the command mv.
To keep the same names of the files and link them all at the same time, you can run:
ln -s /home/genomics/workshop_materials/unix_tutorial/data/* . (The command assumes you are in working_directory)
How many sequences does the FASTA file contain?
Extract sequence headers and save them to a separate file
Simplify the fasta header to only contain >NODE_X (NODE, followed by its unique contig number).
Reduce fasta header to a maximum of 5 characters (making sure headers are unique)
How many sequences does the FASTA file contain?
grep -c "^>" fasta_file.fasta
sed -n '/^>/p' fasta_file.fasta | wc -l
awk '/^>/{count++} END{print count}' fasta_file.fasta
Extract sequence headers and save them to a separate file
grep "^>" fasta_file.fasta > headers.txt
sed -n '/^>/p' fasta_file.fasta > headers.txt
awk '/^>/{print}' fasta_file.fasta > headers.txt
Simplify the fasta header to only contain >NODE_X (NODE, followed by its unique contig number).
sed '/^>/ s/\(>NODE_[0-9]*\).*/\1/' fasta_file.fasta > simplified_fasta.fasta
We need to escape the parentheses because sed only uses basic regular expressions, which properly identifyies symbols like ^, *, or .
But we can add the flag -E to mark that we want to use the parentheses as regular expressions:
sed -E 's/(>NODE_[0-9]*).*/\1/' fasta_file.fasta > simplified_fasta.fasta
awk -F"_" '/^>/ {print $1 "_" $2} !/^>/ {print}' fasta_file.fasta > simplified_fasta.fasta
Reduce fasta header to a maximum of 5 characters (making sure headers are unique)
We can make sure we grab the number after NODE_, which can have up to 4 digits, and add our own character before to make it a maximum of 5 characters (for example, the character c for contig)
sed '/^>/ s/^>NODE_\([0-9]\{1,4\}\).*/>c\1/' fasta_file.fasta > simplified_fasta.fasta
sed '/^>/ s/^>NODE_\([0-9]\+\).*/>c\1/' fasta_file.fasta > simplified_fasta.fasta
awk -F "_" '/^>/ {print ">c" $2} !/^>/ {print}' fasta_file.fasta > simplified_fasta.fasta
How many reads does the fastq files contain?
Calculate the average length of the reads.
How many reads does the fastq files contain?
grep -c '^@SRR' fastq_file.fastq
sed -n '/^@SRR/p' fastq_file.fastq | wc -l
awk '/^@SRR/ {count++} END {print count}' fastq_file.fastq
Calculate the average length of the reads
awk 'NR % 4 == 2 { total_length += length($0); count++ } END { if (count > 0) print total_length / count }' fastq_file.fastq
NR % 4 == 2: This condition selects every second line (the sequence lines in a FastQ file).
total_length += length($0): Adds the length of each sequence to the total_length variable.
count++: Counts the number of sequences processed.
END { if (count > 0) print total_length / count }: After processing all lines, it calculates and prints the average length of the reads.
We can also use the length specified in the header:
grep '^@SRR' fastq_file.fastq | sed 's/.*length=\([0-9]*\).*/\1/' | awk '{ total_length += $1; count++ } END { if (count > 0) print total_length / count }'
grep '^@SRR' fastq_file.fastq: Extracts all header lines that start with @SRR
sed 's/.*length=\([0-9]*\).*/\1/': Uses sed to extract the numeric value following length= from the header line.
awk '{ total_length += $1; count++ } END { if (count > 0) print total_length / count }': Adds the extracted length to total_length and counts the number of sequences, then calculates the average.
Save to a new file all the lines that are variants, excluding the VCF metadata
How many variants have been called?
How many of those are SNPs?
How many are INDELs?
How many transitions appear across all SNPs?
How many transversions?
Save to a new file all the lines that are variants, excluding the VCF metadata
grep -v '^#' input.vcf > variants_only.vcf
sed '/^#/d' input.vcf > variants_only.vcf
matches lines starting with #
the flag "d" means delete
awk '$1 !~ /^#/' input.vcf > variants_only.vcf
$1: Refers to the first field (column) of each line
!/^#/: Matches lines where the first field does not start with #
Redirect the output to variants_only.vcf
How many variants have been called?
grep -v '^#' input.vcf | wc -l
grep -v -c '^#' input.vcf
awk '!/^#/' input.vcf | wc -l
awk '!/^#/{count++} END {print count}' input.vcf
How many of those are SNPs?
grep -v '^#' input.vcf | grep -c 'TYPE=snp'
grep -v '^#' input.vcf | grep 'TYPE=snp' | wc -l
awk '!/^#/ && /TYPE=snp/ {count++} END {print count}' input.vcf
How many are INDELs?
grep -v '^#' input.vcf | grep -E 'TYPE=(ins|del)' | wc -l
awk '!/^#/ && /TYPE=(ins|del)/ {count++} END {print count}' input.vcf
How many transitions appear across all SNPs?
To count transitions (Ts) and transversions (Tv) among SNPs in a VCF file, we compare the REF and ALT columns. Transitions occur between purines (A ↔ G) or pyrimidines (C ↔ T), while transversions involve purine ↔ pyrimidine changes (A ↔ C, A ↔ T, G ↔ C, G ↔ T).
awk '!/^#/ && /TYPE=snp/ { if (($4 == "A" && $5 == "G") || ($4 == "G" && $5 == "A") || ($4 == "C" && $5 == "T") || ($4 == "T" && $5 == "C")) { transitions++ } } END {print "Transitions:", transitions}' input.vcf
How many transversions?
awk '!/^#/ && /TYPE=snp/ { if (($4 == "A" && ($5 == "C" || $5 == "T")) || ($4 == "G" && ($5 == "C" || $5 == "T")) || ($4 == "C" && ($5 == "A" || $5 == "G")) || ($4 == "T" && ($5 == "A" || $5 == "G"))) { transversions++ } } END {print "Transversions:", transversions}' input.vcf
How many genes have been predicted/annotated? Be aware that some lines are headers, and we also have the entire genome sequence at the end! Find some common pattern shared only by the annotated gene lines.
How many genes we failed to identify with an actual protein name?
How many genes are larger than 2,000 basepairs?
How many genes have been predicted/annotated? Be aware that some lines are headers, and we also have the entire genome sequence at the end! Find some common pattern shared only by the annotated gene lines.
grep -c 'ID=' input.gff
awk '/ID=/' input.gff | wc -l
awk '/ID=/ {count++} END {print count}' input.vcf
You can also decide to only check the coding sequences, in that case, the commands would be:
grep -c 'CDS.*ID=' input.gff
awk '$3 == "CDS" && /ID=/' input.gff | wc -l
How many genes we failed to identify with an actual protein name?
grep -c 'product=hypothetical protein' input.gff
awk '/product=hypothetical protein/' input.gff | wc -l
awk '/product=hypothetical protein/ {count++} END {print count}' input.gff
How many genes are larger than 2,000 basepairs?
You can identify genes larger than 2,000 base pairs by calculating the difference between the start and end positions in the GFF file.
awk '$3 == "CDS" && ($5 - $4 + 1) > 2000' input.gff | wc -l
awk '$3 == "CDS" && ($5 - $4 + 1) > 2000 {count++} END {print count}' input.gff