RNA-Seq Tuxedo tools workflow

Workflow using the Tuxedo tools via Galaxy

Below is a workflow to analyze RNA-Seq data (Illumina single-end) using the Tuxedo tools suite via Galaxy.

For an alternative method of quantification and differential expression, see our DESeq2 tutorial.

Preprocess the Reads

Follow the steps outlined in the DESeq2 tutorial.

Map the Reads

Follow the steps outlined in the DESeq2 tutorial.

Assemble Transcripts and Quantitate

8 - Assemble Transcripts and estimate FPKM (normalized counts equating to expression levels) for genes and transcripts.

NGS: RNA Analysis - Cufflinks

Perform Quantile Normalization: YES improves estimates for transcripts with low abundance

Perform Bias Correction: YES

Use Reference Annotation: Use Reference

Other commonly changed parameters:

Max intron length (default is for mammals, likely lower for other species)

Use Reference Annotation:

No - de novo assembly

Use Reference - Tells Cufflinks to use the supplied reference annotation (a GFF file) to estimate isoform expression. It will not assemble novel transcripts, and the program will ignore alignments not structurally compatible with any reference transcript.

Use Reference As Guide - Tells Cufflinks to use the supplied reference annotation (GFF) to guide RABT assembly. Reference transcripts will be tiled with faux-reads to provide additional information in assembly. Output will include all reference transcripts as well as any novel genes and isoforms that are assembled.

Cufflinks documentation describes the output files

9 - CuffMerge will merge the assembled transcripts into a single transcript model

This is not necessary if you only used reference annotations since all samples will share the same transcripts.

NGS: RNA Analysis - Cuffmerge

10 - Use Cuffcompare to compare the assembled (and merged) transcripts to the reference genome.

This is not necessary if you only used reference annotations since all samples will share the same transcripts. However Depending of your reference GTF annotation file, Cuffcompare also performs the function of creating p_id (protein id) and tss_id (transcription start site id) which can be used by Cuffdiff later.

NGS: RNA Analysis - Cuffcompare

Use Reference: Yes

Cuffcompare documentation describes the output files

Repeat the quantitation steps for other samples.

Find Differentially Expressed Genes/Transcripts

11 - Use Cuffdiff to determine differentially expressed genes and transcripts

NGS: RNA Analysis - Cuffdiff

Select two Tophat accepted hits BAM files to compare

Perform quantile normalization: YES

Perform Bias Correction: YES

Perform Replicate Analysis: YES

Create a group for each treatment type and give appropriate names (Control, Treatment1, Treatment2)

Add one replicate for each group

For our example we are not using replicates, but you must for all published results

Other commonly changed parameters:

False discovery rate: threshold for determining what gets flagged as "significant"

Min Alignment Count: threshold that determines how many reads a transcripts needs to be given an "OK" status instead of "NOTEST"

Perform Replicate Analysis: For our example we are not using replicates, but you must do this for all published results

You can created named groups for each condition and add replicates to each group.

Cuffdiff documentation describes the many output datasets.

We are primarily interested in:

gene/transcript FPKM tracking (tabular)

information about the gene/transcript (length, nearest_ref_id=NM_#####, TSS, etc) and the confidence intervals for FPKM for each condition.

gene/transcript differential expression testing (tabular)

expression change between conditions, status of whether there was enough data for that value to be accurate (OK is good, FAIL and NOTEST are bad. LOWDATA is somewhere in between), and a p-value.

12 - Use various text manipulation tools to extract the data you are interested in

Galaxy offers a wide array of text manipulation tools that can be used to select the data you want.

Much of this could also be done in Excel, but once you figure out how to do it in Galaxy, you can save it in a workflow and re-use it later.

Examples:

  • Select genes with OK or LOWDATA status from gene differential expression testing
      • Filter and Sort -> Filter
        • Condition: c7=='OK' or c7=='LOWDATA' or c7=='status' (the last terms keep the header)
  • Select genes with significant differential expression
      • Filter and Sort -> Filter
        • Condition: c14 == 'yes' or c14 == 'significant' (the last terms keep the header)
  • Sort by log2 fold change
      • Filter and Sort -> Sort
RNA-Seq Data Analysis Workshop