RNA-Seq Tuxedo tools workflow
Workflow using the Tuxedo tools via Galaxy
Below is a workflow to analyze RNA-Seq data (Illumina single-end) using the Tuxedo tools suite via Galaxy.
For an alternative method of quantification and differential expression, see our DESeq2 tutorial.
Preprocess the Reads
Follow the steps outlined in the DESeq2 tutorial.
Map the Reads
Follow the steps outlined in the DESeq2 tutorial.
Assemble Transcripts and Quantitate
8 - Assemble Transcripts and estimate FPKM (normalized counts equating to expression levels) for genes and transcripts.
NGS: RNA Analysis - Cufflinks
Perform Quantile Normalization: YES improves estimates for transcripts with low abundance
Perform Bias Correction: YES
Use Reference Annotation: Use Reference
Other commonly changed parameters:
Max intron length (default is for mammals, likely lower for other species)
Use Reference Annotation:
No - de novo assembly
Use Reference - Tells Cufflinks to use the supplied reference annotation (a GFF file) to estimate isoform expression. It will not assemble novel transcripts, and the program will ignore alignments not structurally compatible with any reference transcript.
Use Reference As Guide - Tells Cufflinks to use the supplied reference annotation (GFF) to guide RABT assembly. Reference transcripts will be tiled with faux-reads to provide additional information in assembly. Output will include all reference transcripts as well as any novel genes and isoforms that are assembled.
Cufflinks documentation describes the output files
9 - CuffMerge will merge the assembled transcripts into a single transcript model
This is not necessary if you only used reference annotations since all samples will share the same transcripts.
NGS: RNA Analysis - Cuffmerge
10 - Use Cuffcompare to compare the assembled (and merged) transcripts to the reference genome.
This is not necessary if you only used reference annotations since all samples will share the same transcripts. However Depending of your reference GTF annotation file, Cuffcompare also performs the function of creating p_id (protein id) and tss_id (transcription start site id) which can be used by Cuffdiff later.
NGS: RNA Analysis - Cuffcompare
Use Reference: Yes
Cuffcompare documentation describes the output files
Repeat the quantitation steps for other samples.
Find Differentially Expressed Genes/Transcripts
11 - Use Cuffdiff to determine differentially expressed genes and transcripts
NGS: RNA Analysis - Cuffdiff
Select two Tophat accepted hits BAM files to compare
Perform quantile normalization: YES
Perform Bias Correction: YES
Perform Replicate Analysis: YES
Create a group for each treatment type and give appropriate names (Control, Treatment1, Treatment2)
Add one replicate for each group
For our example we are not using replicates, but you must for all published results
Other commonly changed parameters:
False discovery rate: threshold for determining what gets flagged as "significant"
Min Alignment Count: threshold that determines how many reads a transcripts needs to be given an "OK" status instead of "NOTEST"
Perform Replicate Analysis: For our example we are not using replicates, but you must do this for all published results
You can created named groups for each condition and add replicates to each group.
Cuffdiff documentation describes the many output datasets.
We are primarily interested in:
gene/transcript FPKM tracking (tabular)
information about the gene/transcript (length, nearest_ref_id=NM_#####, TSS, etc) and the confidence intervals for FPKM for each condition.
gene/transcript differential expression testing (tabular)
expression change between conditions, status of whether there was enough data for that value to be accurate (OK is good, FAIL and NOTEST are bad. LOWDATA is somewhere in between), and a p-value.
12 - Use various text manipulation tools to extract the data you are interested in
Galaxy offers a wide array of text manipulation tools that can be used to select the data you want.
Much of this could also be done in Excel, but once you figure out how to do it in Galaxy, you can save it in a workflow and re-use it later.
Examples:
- Select genes with OK or LOWDATA status from gene differential expression testing
- Filter and Sort -> Filter
- Condition: c7=='OK' or c7=='LOWDATA' or c7=='status' (the last terms keep the header)
- Filter and Sort -> Filter
- Select genes with significant differential expression
- Filter and Sort -> Filter
- Condition: c14 == 'yes' or c14 == 'significant' (the last terms keep the header)
- Filter and Sort -> Filter
- Sort by log2 fold change
- Filter and Sort -> Sort