What changes in soybean gene expression do we uncover when aligning reads to different reference genomes?
RNA from old and young leaves samples, as well as meristem samples, were isolated to form sequencing libraries.
Next generation sequencing was used to generate paired end reads. (i.e. Illumina sequencing)
Figure 1: Graphical abstract of RNA sequence workflow of Glycine Max young and old leaf samples, as well as meristem samples, compared to the indexed Glycine_max_Lee_v1 reference genome.
Set up a working directory and QC data
We first set up our own directory for working within the HPC system at NCSU. FastQC was performed on the Raw Sequence Data for Glycine Max by writing code in Linux. This was done to check the quality of the sequence reads before continuing.
Build an indexed reference genome to align your sequences
A reference genome was indexed to align the Raw Sequence Data to it. Indexing was done by using the STAR indices software which was utilized through code. The reference genome was Glycine_max_Lee_v1.
Align your sequences to the genome
A code was made to align the raw sequence data for Glycine max to the indexed reference genome. The STAR software was utilized for the alignment. This would ultimately output a BAM file for quantification.
Quantify the aligned sequences into counts to be analyzed downstream
The Salmon software was used through making a code to quantify the alignments. The quantified files would show the gene expression levels of the data. These could then be viewed as text files or converted to excel sheets for viewing.
Off the HPC you will explore your data outputs using a free graphical user interface, GALAXY
The quantification files were uploaded to GALAXY using DESeq2. The normalization of the data would adjust and account for factors that prevent direct comparison. The data output could then be analyzed for the differentially expressed genes.
I want to thank the CPT learning community, my teammates Monica Judd and Carlos Cofre, the instructors Dr. Carly Sjorgen, Dr. Emily Delorean, Dr. Emily Cartwright, and Edmaritz Hernandez Pagan. I also want to acknowledge the NC State HPC and bioinformatic resources that made this research possible.