Exercises

Time to put into practice everything you learned!

Inside the folder /home/genomics/workshop_materials/unix, you will find a folder called data. It contains a dataset that includes:

Sequencing read files (fastq), originated with paired-end Illumina.
Genome assembly file (fasta) containing a few chromosomes.
Genome annotation in gtf and gff formats
Repeat elements and locations (bed file)
Variant calling file, showing the population variance (vcf)

Let's start by setting our working directory!

For each of the files in the folder data, produce a symlink inside the folder working_directory, within the unix folder.

Now we can easily work with the files without having to copy them or risk modifying them with the command mv.

Submit your answers for the creation of a collaborative answer sheet

Download the data

FASTQ

How many reads does the fastq files contain?

Calculate the average length of the reads.

FASTA

Simplify fasta headers

Modify the fasta header so that it does not contain spaces

Modify the fasta header to a shorter identifier (no more than 5 characters), while keeping it informative.

How long is each of the contigs?

Can you calculate the N50 of the genome assembly?

VCF

Print all the lines that are variants, excluding the VCF metadata

How many variants have been called?

How many transitions appear across all SNPs?

How many transversions?

BED

Count the total number of lines in the file

Get a list of all the unique repeat elements contained in the file

Get the number of unique DNA elements contained in the file

Get the number of lines in the file that are a DNA element

Count the number of lines on scaffold "NC_054069.1"

Count the number of lines on scaffold "NC_054069.1" that are a simple repeat

Print all the satellite elements in the file

Print all the satellite elements in the file that are larger than 1000 basepairs

Get the total number of basepairs covered by all repeat elements

Count the total number of basepairs covered by a satellite element

Calculate the average size of the satellite elements

GFF

Count how many lines are in the file

Count how many non-header lines there are in the file

Print a list of the unique feature types in the file

Count the number of genes in the file

Count the number of genes larger than 10,000 basepairs

Count the number of exons in the gene "HYOU1"

Calculate the size of the gene "HYOU1"

Extra challenge!

Produce a command with the most amount of pipes!

Produce the most unreadable sed-substitute command!

-> submit your commands here (at the end)!

Page updated

Google Sites

Report abuse