Genomics

Genomics focuses on the structure, function, evolution, mapping, and editing of genome - an organism's complete set of DNA including all of its genes. The DNA contains the fundamental information in a form of a long molecular string made up of four different chemical units called nucleotide (Nitrogenous bases) Adenine (A), Thymine (T), Cytosine (C) and Guanine (G) in the rungs of DNA twisted ladder (double helix) as showed in Fig. 1 bonded to each other by hydrogen bonds. This unique sequence of letters encodes information about the control of the molecular processes in an organism including protein synthesis by the organism.

Fig.1: Histons, alkaline proteins, package and order DNA into structural unit called Nucleosomes which plays the role in gene expression and Regulation [1].

Human Genome contains 23 pairs of chromosomes packed into the nucleus of human cell: 23 from each parent and 23rd pair is the sex chromosome. Each human cell in the body contains a complete copy of approximately 3 billion DNA base pairs which enables a one-cell embryo to develop into a 100-trillion-cell human adult. Gene is the sub-unit of DNA that contains particular sets of instructions for coding particular protein or functionality. Each of estimated 20,00 to 25,000 protein-coding-genes in the human genome codes for an average of three proteins. The average protein-coding gene is about 3,000 letters long with the shortest and longest ones comprising of 500 and 2.3 million letters respectively. The cell uses two-step process of gene expression (transcription and translation) to read each gene and produce the string of amino acids that makes up a protein as showed in Fig. 2 and Fig. 3.

Image result for transcribe, utah, genetics

Fig. 2: Two-step process of transcription and translation to read each gene and produce the string of amino acids that makes up a protein [2]

File:Gene structure eukaryote 2 annotated.svg

Fig. 3: Regulatory sequence controls when and where gene expression occurs for protein coding region (red). Promoter and enhancer regions (yellow) regulate the transcription of the gene into a pre-mRNA which is modified to remove introns (light grey) and add a 5' cap and poly-A tail (dark grey). The mRNA 5' and 3' untranslated regions (blue) regulate translation into the final protein product [3].

The protein coding genes only account for about 1.5% of genome's total nucleotides. A small chunk of the genome contains non-protein-coding genes which code for RNA products such as tRNA (transfer RNA) and rRNA (ribosomal RNA) But the bulk of the genome doesn't code but have been found to be associated with biochemical activities such as gene regulation, organization of chromosome architecture, and signal controlling epigentic inheritence. If the cell's DNA is mutated (i.e. change in the sequence), an abnormal protein may be produced or genes' expressions got uncontrolled which can disrupt the body's usual function and can lead to a disease such as cancer.

The information of the same gene can create different mRNA (messenger RNA) known as alternative splicing which allows the cells to make many proteins than we have genes.

DNA Vs RNA Sequencing

Sequencing is the process of determining the precise order of nucleotides within RNA and DNA molecule. Since DNA molecules are small and nucleotide is spaced about 3x10-10m apart, it is almost impossible to identify them with the highest-powered microscope existing now. Frderick Sanger adopted the primer extension strategy by Ray to develop Chain-Termination methods for DNA Sequencing in 1977 where as Allan Maxam independently develop DNA sequencing by chemical degradation in 1973. Sanger's DNA sequencing method prevailed till mid-2000s with advancement in fluorescent labeling, capillary electrophoresis, and general automation with the production of the first human genome in 2001.

The high demand for low-cost sequencing has driven the development of Next-Generation Sequencing (NGS) (Fig. 4 & Fig. 5), a non-Sanger-based high-throughput sequencing technology which parallelizes the sequencing process producing thousands or millions of sequences or fragments (short reads) concurrently. The multiple fragmented sequence reads need to be assembled together on the basis of the overlapping areas or aligned to the reference Genome (RNA-alignment).

Aligning or Mapping reads to a genome can be viewed in the general context of approximate string matching. The goal is to find a pattern; a short-read esp in text-based FASTA or FASTQ format that represents either nucleotide sequences or peptide sequences, in a corpus or genome (reference), allowing for mismatches and indels (inserted/deleted bases) in the genome of an organism. Indexing is the technique to pre-process or index the reference to make queries fast and also that can even compress the size of the text. At the heart of Bowtie is the Burrows-Wheeler transform, which is used to compress the size of the genome and also make searching the genome faster.

Fig. 4: DNA Sequencing [1]

Fig. 5: Next-Generation DNA Sequencing [4]

In the most common type of DNA sequencing, Sequencing by Synthesis, the nucelotides are chemically tagged with a fluorescent label. The sequence information can be used to determine which stretches of DNA contain genes and which stretches carry regulatory instructions to determine genetic variations/mutations or disease causing change (substitution, deletion, or addition) of base pairs.

The Human Genome Project led by NIH has produced a very high-quality version of the human genome sequence: a generic sequence derived from several individuals, that is freely available in public database that can be used for a broad range of biomedical studies to search for genetic variations and/or mutations that may play a role in the development or progression of disease such as cancer. For an example, to look for the genetic variations that increase the risk of specific disease or look for type of genetic mutations frequently seen in cancerous cells.

RNA sequencing (RNA-Seq) is the sequencing of the complementary DNA (cDNA) that uses NGS that provides a more detailed and quantitative overtime view of gene expression, alternative splicing, and allele-specific expression in contrast to DNA-Seq's static picture of what a cell or organism might do or become. RNA-seq has enabled deep profiling of the transcriptome (the set of all RNA molecules in one cell or a population of cells) and analysis of different physiological and pathological conditions. Hence, RNA-Seq is also called Whole Transcriptome Shotgun Sequencing(WTSS).

Using Software @ HPC

Case Study: Single Nucleotide Variation (SNV) - [4]

SNV is the difference at one base or nucelotide e.g. ATG -> CTG.

Request a compute node

srun --x11 --pty bash

Copy the directory GENOMICS from /usr/local/doc to your home directory.

cp -r /usr/local/doc/GENOMICS .

There are two files NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_1.fastqsanger, NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_2.fastqsanger, one million 100bp reads produced on an Illumina HiSeq2000(30 x coverage,paired-end) and the fasta file for human chromosome 20 from Human Genome version 19 (hg19).

Unzip the human chromosome 20

gunzip chr20.fa.gz

Load the Bowtie, tophat, samtools, bcftools, and circos modules:

module load bowtie

module load tophat

module load samtools

module load bcftools

module load circos

Build an index of the human chromosome 20 usingBowtie 2. It will generate several files (chr20.*.bt2 - BowTie2) with FM-index (a compressed full-text substring index based on the Burrows-Wheeler transform) of human chromosome 20.

bowtie2-build chr20.fa chr20

output:

Settings:

Output files: "chr20.*.bt2"

Line rate: 6 (line is 64 bytes)

Lines per side: 1 (side is 64 bytes)

Offset rate: 4 (one in 16)

FTable chars: 10

Strings: unpacked

Max bucket size: default

Max bucket size, sqrt multiplier: default

Max bucket size, len divisor: 4

Difference-cover sample period: 1024

Endianness: little

Actual local endianness: little

Sanity checking: disabled

Assertions: disabled

Random seed: 0

Sizeofs: void*:8, int:4, long:8, size_t:8

Input files DNA, FASTA:

chr20.fa

Building a SMALL index

...

Align or map the reads (NA12878*) to the reference genome (chr20.fa) using TopHat

tophat chr20 NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_1.fastqsanger NA12878.hiseq.wgs_chr20_2mb.30xPE.fastq_2.fastqsanger

output:

[2017-09-11 12:24:58] Beginning TopHat run (v2.1.0)

-----------------------------------------------

[2017-09-11 12:24:59] Checking for Bowtie

Bowtie version: 2.2.6.0

[2017-09-11 12:24:59] Checking for Bowtie index files (genome)..

[2017-09-11 12:24:59] Checking for reference FASTA file

[2017-09-11 12:24:59] Generating SAM header for chr20

[2017-09-11 12:24:59] Preparing reads

left reads: min. length=101, max. length=101, 242814 kept reads (1294 discarded)

right reads: min. length=101, max. length=101, 241083 kept reads (3025 discarded)

[2017-09-11 12:25:12] Mapping left_kept_reads to genome chr20 with Bowtie2

....

[2017-09-11 12:34:05] A summary of the alignment counts can be found in ./tophat_out/align_summary.txt

[2017-09-11 12:34:05] Run complete: 00:09:06 elapsed

Find the files (.bam,.bed) generated at ./tophat_out directory:

- accepted_hits.bam—This BAM-format file details the mapping of each read to the genome. The accepted_hits.bam file is used for further analysis in quantification of transcripts within the sample.
- junctions.bed —This file contains information, including genomic location and supporting evidence, about all of the splice junctions discovered by TopHat in the process of aligning reads.

Obtain sample statistics from .bam file

samtools flagstat *.bam

output:

142 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

142 + 0 mapped (100.00%:-nan%)

Change directory to tophat_out

cd tophat_out

Use samtools sort to convert the BAM file to a sorted BAM file

samtools sort accepted_hits.bam chr20MappingBowtie.sorted

Sorted BAM is a useful format because the alignments are both compressed and sorted, which is convenient for long-term storage and for variant discovery respectively.

You can also use samtools to create an index of the sorted mapping

samtools index chr20MappingBowtie.sorted.bam

There is an index file .bai (i=> index) created in that directory.

View the mapping via IGV (Interactive Genome Viewer). Open the IGV GUI

java -Xmx750m -jar /home/sxg125/Software/genomics/tophat_out/IGV_2.3.97/igv.jar &

Load the sorted BAM file - File -> Load File From ... and browse to chr20MappingBowtie.sorted.bam.

Follow the instructions in the attached file (07_lab.pdf) and you will see different regions as showed in Fig. 6.

Fig 6: Viewing Mapping in IGV

Generate variant calls using mpileup in samtools

samtools mpileup -g -f ../chr20.fa chr20MappingBowtie.sorted.bam > chr20SAMvariants.bcf

output:

[mpileup] 1 samples in 1 input files

<mpileup> Set max per-file depth to 8000

...

The mpileup command automatically scans every position supported by an aligned read, computes all the possible genotypes supported by these reads, and then computes the probability that each of these genotypes is truly present in our sample [4].

Use the BCFTools for the actual variant calling - process by which variants are identified from sequence data.

bcftools call -c -v -O v chr20SAMvariants.bcf -o chr20BCFTOOLSvariants.vcf

Check the status of the variants (SNP, indels) in the .vcf file created with bcftools call

bcftools stats chr20BCFTOOLSvariants.vcf

output:

# This file was produced by bcftools stats (1.3+htslib-1.3) and can be plotted using plot-vcfstats.

# The command line was: bcftools stats chr20BCFTOOLSvariants.vcf

# Definition of sets:

# ID [2]id [3]tab-separated file names

ID 0 chr20BCFTOOLSvariants.vcf

# SN, Summary numbers:

# SN [2]id [3]key [4]value

SN 0 number of samples: 1

SN 0 number of records: 327

SN 0 number of no-ALTs: 0

SN 0 number of SNPs: 316

SN 0 number of MNPs: 0

SN 0 number of indels: 11

SN 0 number of others: 0

Dump the stats on the plot file to be plotted by plot-vcfstats

bcftools stats chr20BCFTOOLSvariants.vcf > plot

Plot the PNG files

plot-vcfstats -P -p ./ plot

output:

Parsing bcftools stats output: plot

Plotting graphs: python ./plot.py

it will create several several files including png files

Make a montage of 4 png files using ImageMagick montage command

montage -mode concatenate *.png -tile 2x2 variant.png

View in ImageMagick for variants as showed in Fig. 7

display variant.png

Fig. 7: Variants plot

Convert the VCF file (chr20BCFTOOLSvariants.vcf ) to data file (vcf.dat).

awk '/^#/ {next} {printf("%s\t%d\n",$1,$2-$2%10000);}' chr20BCFTOOLSvariants.vcf | sort | uniq -c | awk '{printf("hs%s\t%s\t%d\t%s\n",$2,$3,$3+10000,$1);}' > vcf.dat

This vcf.dat has been called at vcf.conf file (Circos configuration file [5] )that is available in the GENOMICS directory.

# The type sets the format of the track.

type = histogram

file = vcf.dat

Execute circos using vcf.dat

circos -outputdir ./ -outputfile vcf -conf vcf.conf

View the circos image as showed in Fig. 8

display vcf.png