Bioinformatics Pipeline

Bioinformatics pipeline for Comparative Transcriptomics Analysis: The pipeline starts with learning to use High Power Computing (HPC) which at NCSU is referred to as Henry2 and command line interface such as the Linux based Terminal on macOS. When appropriately trained the Terminal is used to navigate to the bitcpt class directory and make a working directory with sub directories. The RNAseq data obtained from high throughput sequencing platform as well as scripts for using relevant software on HPC can then be copied into the relevant sub directories (eg: Portfolio)

The RNAseq data is then checked for quality by performing FASTQC using the relevant edited script to request HPC maintained software conda to turn on the user maintained FASTQC software. The output file is a HTML file which can then be downloaded on to the personal computer using globus.org. Globus is a research data management app that allows for moving and sharing data using a single graphical user interface.

Next the reference genome of Diploid potato landrace already downloaded into the class directory in referenceGenome directory is indexed using STAR index tool giving the relevant output that is generated in starindices directory. Genome assemblies are usually created by overlapping sequences called contigs which are then ordered and oriented into larger fragments called scaffolds. The scaffolds are further ordered and oriented to form the chromosome representations.

The clean data from FASTQC is then aligned to the indexed Diploid potato sequence using the STAR align tool. The scripts for aligning is obtained by copying the scripts used for Arabidopsis in the class directory and editing it for the tomato data. The relevant output is Aligned.toTranscirptome.out.bam file generated in the AlignedToTranscriptome directory.

The aligned sequences are then quantified and normalized using the SALMON tool, again the scripts used for Arabidopsis are copied into relevant directory and edited for tomato. Salmon uses genome wide transcriptome file already downloaded in the class directory. The output is generated in salmon_align_quant directory as quant.sf files, one for each sample.

Finally, the quant.sf files are transferred onto the personal computer using globus.org and then upload onto usegalaxy.org for analyzing differential expression. The DESeq2 tool in usegalaxy.org generates five plots including PCA plot, Sample to Sample Distances, Dispersion estimates, Histogram of p-values and MA plot. Alternatively the usegalaxy.org can also be used to run EdgeR tool for differential expression analysis.

**See the Bioinformatics Codes subpages for relevant scripts for using FASTQC, STAR and SALMON in Henry2.

Figure 1: Graphical Representation of the RNAseq bioinformatics pipeline followed to perform Comparative Transcriptomic Analysis of the Tomato Leaf and Meristem samples referenced to the Diploid Potato.

Page updated

Report abuse