Inside the folder /home/genomics/workshop_materials/unix, you will find a folder called data. It contains a dataset that includes:
Sequencing read files (fastq), originated with paired-end Illumina.
Genome assembly file (fasta) containing a few chromosomes.
Genome annotation in gtf and gff formats
Repeat elements and locations (bed file)
Variant calling file, showing the population variance (vcf)
For each of the files in the folder data, produce a symlink inside the folder working_directory, within the unix folder.
Now we can easily work with the files without having to copy them or risk modifying them with the command mv.
How many reads does the fastq files contain?
Calculate the average length of the reads.
Simplify fasta headers
Modify the fasta header so that it does not contain spaces
Modify the fasta header to a shorter identifier (no more than 5 characters), while keeping it informative.
How long is each of the contigs?
Can you calculate the N50 of the genome assembly?
Print all the lines that are variants, excluding the VCF metadata
How many variants have been called?
How many transitions appear across all SNPs?
How many transversions?
Count the total number of lines in the file
Get a list of all the unique repeat elements contained in the file
Get the number of unique DNA elements contained in the file
Get the number of lines in the file that are a DNA element
Count the number of lines on scaffold "NC_054069.1"
Count the number of lines on scaffold "NC_054069.1" that are a simple repeat
Print all the satellite elements in the file
Print all the satellite elements in the file that are larger than 1000 basepairs
Get the total number of basepairs covered by all repeat elements
Count the total number of basepairs covered by a satellite element
Calculate the average size of the satellite elements
Count how many lines are in the file
Count how many non-header lines there are in the file
Print a list of the unique feature types in the file
Count the number of genes in the file
Count the number of genes larger than 10,000 basepairs
Count the number of exons in the gene "HYOU1"
Calculate the size of the gene "HYOU1"
Produce a command with the most amount of pipes!
Produce the most unreadable sed-substitute command!
-> submit your commands here (at the end)!