Trimming reads using Trim Galore
I used Trim Galore (version 0.4.4) for adapter trimming from raw fastq sequence files. This program uses Cutadapt and fastqc to trim reads of any adapter sequences due to sequencing protocols. I used the paired option to trim adapter sequences from read2 of the paired-end files. I used the default adapter option which removes first 13 bp of the Illumina adapter ‘AGATCGGAAGAGC’. I chose this option because when I ran FastQC on the raw sequences, I detected this adapter sequence in all of the reads. All other options were kept at default. I used the fastqc option in the program to run fastqc on the files post trimming to do quality check.
Building Trinity de novo assembly
I build a de novo transcriptome assembly using Trinity (version 2.6.6). I used the default options. I gave the minimum contig length as 150 bp. This created the Trinity.fasta file which I used for further downstream analysis. I also used this file to build a bowtie assembly and index the genome for running TopHat alignments.
Tophat
I used TopHat (version 2.6.6) to align the trimmed reads. I did alignments twice: first, using the Dovetail genome and second, using the de novo transcriptome assembly I build using Trinity (above). For both alignments, I set the edit distance to 5, read mismatches to 3, segment mismatches to 3, read mismatches to 4, read gap length to 4, align distance to 0. Used 12 threads and maximum intro length as 1000 and minimum intron length as 20. I describe the alignments below
Using genome reference
Here I used the dovetail genome as a reference. The concordant alignment percentages ranged from 42-52% across all samples.
Using Trinity de novo transcriptomic assembly
Here I used the de novo transcriptomic assembly as reference. The concordant alignment percentages were higher and ranged from 86-90%.
Here is the table listing the concordant alignment percentage for each sample ?(12 samples):
Cufflinks
I used Cufflinks (v2.2.1) to get the expression values and counts for transcripts in each sample. I again ran two sets of analysis for Cufflinks (for alignments using genome and denovo assembly). I used the transcript.gtf output file for each sample to create a final file of FPKM values and coverage values.
Using genome reference
This led to 51,428 unique identified transcripts.
Using Trinity de novo transcriptomic assembly
This led to 32973 unique identified transcripts.