RNA-Seq

RNA-Seq analysis workflow using Galaxy and DESeq2

Below is a simple workflow to analyze RNA-Seq data (Illumina single-end) using Galaxy.

Setup the Data and Workflow

  1. Import raw fastq reads into Galaxy as a Collection
    • Get Data -> Upload File
      • If uploading data from another online location (e.g. Princeton's HTSEQ database), it is useful to use the "Rule Based Uploader" tab, and paste a table of sample names and URLs
      • If upload from your computer, use the "Collection" tab. Once you've uploaded the files, click the "Build" button to assign the appropriate sample name to each file and create a collection.
  2. Import a copy of the RNA-Seq workflow
  3. Run FastQC on fastq file(s) for each sample/replicate. Look for low quality, bias in sequence, etc.
  • NGS: QC and manipulation -> Fastqc: Fastq QC
    • Use Per Base Sequence Content and Kmer Content to determine if adapter trimming is necessary

3 - Optional - Trim poor quality bases and/or adapter sequences from reads

  • NGS: QC and manipulation -> Cutadapt
  • Qualities are in Phred Scale, usually 20 or 30 are good thresholds

Map the Reads

4 - Upload or import a GTF file for the genome you are working with.

  • An excellent source of GTF files for many organisms is Illumina's iGenomes. These GTF files are augmented with the tss_id and p_id GTF attributes that Cufflinks needs to perform differential splicing, CDS output, and promoter user analysis.
  • See below for additional sources of GTF files.
  • Princeton users: Many of these files are already available in the Princeton Galaxy server's Data Libraries. Look in Shared Data -> Data Libraries -> Reference Genomes and Annotations for your genome.

5 - Map reads to genome using TopHat. TopHat accounts for reads that cross splice junction boundaries (see TopHat documentation).

  • NGS: RNA Analysis -> Tophat2
    • Select the appropriate reference genome
    • TopHat Settings: Full Parameter List
      • Using Default Parameters may be OK, but may not give meaningful results
    • Library Type: FR First Strand
      • This is the standard strand specific Illumina protocol
    • For better consistency, it is often best to provide TopHat with known gene annotations and avoid novel gene detection
      • Use Own Junctions: Yes
      • Use Gene Annotation Model: Yes
      • Gene Model Annotations: Select GTF file from step 4 above
      • Only look for supplied junctions: Yes
    • The Tophat documentation describes the output files

Repeat the mapping steps for other samples.

Visualize

6 - Visualize the data in Galaxy Trackster

  • Visualization -> New Track Browser
    • Create the visualization using the appropriate reference build and add datasets to your visualization by clicking on the Add Datasets to Visualization button.
    • Add the 'accepted hits' BAM dataset and the 'splice junctions' dataset produced by Tophat2 and also the GTF file of genes.

7 - Check 5'-3' bias

Reads may not be aligned uniformly across the entire gene body, due to biases associated to the different steps of RNA-seq protocols (fragmentation, RNA degradation, secondary structures, etc.). Uniform coverage of genes is best for quantitation, however, it is also important that each sample have comparable 5'-3' bias. This step will help check the bias in each sample.

  • Upload or import a BED file (12-column format) for the genome you are working with.
    • Get Data -> UCSC Main Table Browser
      • Select the appropriate Clade, Genome, and Assembly
      • Select Group: "Genes and Gene Predictions"
      • Select your desired track and table
      • Select output format: BED
      • Click the "get output" button
      • Select "Whole Gene"
      • Click the "send query to Galaxy" button
  • NGS: RNA-Seq QC -> Gene Body Coverage (BAM)
    • Use the BED file you imported above as the "Reference Genome"
    • Use the "Tophat2 Accepted Hits" file as the input .bam file
      • If you have multiple samples, you can copy the "TopHat2 Accepted Hits" files to a single history and use them all as input to this step to create a single graph.

Quantitation

8 - Use htseq-count to count the reads that align to each exon and sum them up for each gene

  • NGS: RNA Analysis -> htseq-count
    • GFF File: Select the GTF file used for TopHat
    • Mode: Intersection (nonempty)
      • Intersection (nonempty) will typically provide high read counts, while Intersection (strict) is more conservative. See the htseq-count documentation for a complete explanation.
    • Feature type: exon
      • Feature type (3rd column in GFF file) to be used. All features of other types are ignored. The default, suitable for RNA-Seq and Ensembl GTF files, is exon.
    • ID Attribute: gene_id
      • GFF attribute to be used as feature ID. Several GFF lines with the same feature ID will be considered as parts of the same feature. The feature ID is used to identity the counts in the output table. All features of the specified type MUST have a value for this attribute. The default, suitable for RNA-SEq and Ensembl GTF files, is gene_id.

Repeat the quantitation steps for other samples.

Find Differentially Expressed Genes/Transcripts

9 - Download the gene counts files to be used with DESeq2 (or other differential expression software).

  • DESeq2 is not yet part of Galaxy (as of March 2014) and thus you must download the count files and import them into R on your own computer.
  • At this point, the files are relatively small and most any computer will be capable of handling the data and computation.

10 - Setup the sample table

directory <- "directory/with/htseq-count/files"
sampleFiles <- grep("treated",list.files(directory),value=TRUE)
sampleCondition <- sub("(.*treated).*","\\1",sampleFiles)
sampleTable <- data.frame(sampleName = sampleFiles,
    fileName = sampleFiles,
    condition = sampleCondition)

11 - Load the data into R

dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
    directory = directory, design= ~ condition)
colData(dds)$condition <- factor(colData(dds)$condition,
    levels=c("untreated","treated"))
  • Note: It is important to supply levels (otherwise the levels are chosen in alphabetical order) and to put the “control” or “untreated” level as the first element, so that the log2 fold changes and results will be most easily interpretable.

12 - Normalize samples

dds <- estimateSizeFactors(dds)

13 - Sample QC: Transform data to log space and visualize samples

rld <- rlogTransformation(dds, blind = TRUE)
print(plotPCA(rld, intgroup = c("condition")))

14 - Estimate biological variance and visualize

dds <- estimateDispersions(dds)
plotDispEsts(dds)

15 - Determine differential expression

dds <- nbinomWaldTest(dds)

16 - Export results to file

# Get results and write to output files
res <- results(dds)
res <- res[order(res$padj), ]
write.table(cbind('#ID'=rownames(as.data.frame(res)), as.data.frame(res)),
    file='deseq2_results.txt', quote=FALSE, sep='\t', row.names=FALSE) 
# Display the top few genes
head(res)

17 - Visualize with MA plot

plotMA(dds,ylim=c(-2,2),main="DESeq2")

Things to Consider

Mapping

Splice Junctions

Mapping reads across splice junctions is one of the main challenges in RNA-Seq. There are a few approaches:

  1. Exon junction libraries - a reference sequence is constructed using annotated exon boundaries
    • Requires less coverage, but doesn't detect novel junctions. This approach can also lead to better reproducibility between samples. Suitable when the reference transcriptome is well known.
    • Using Tophat/Cufflinks with a reference GTF uses this approach.
  2. De novo splice junction detection - the reads are used to determine splice junctions by mapping parts of reads independently to the reference
    • Requires greater coverage to ensure correct detection. Also slower due to increased computational demands.
    • This is the default method for TopHat and Cufflinks.
  3. De novo transcriptome assembly - No reference is used, but the reads are completely assembled
    • Requires high coverage and high quality data. Most approaches require a large amount of memory.
    • Trinity is an example of software that takes this approach.

When using either of the de novo approaches, there are few additional considerations:

  1. Results may be inconsistent between samples.
  2. Since the search is done separately for each sample, a particular gene/junction may be found in one sample, but not in another simply due to differences in expression/coverage, etc. In order to compensate for this, it may be best to create a "master" gene model by combining the results of the novel gene search from multiple samples and then rerunning the alignments providing this master gene model as input and turning off the novel search. This obviously adds a bit of complexity and time to the analysis, but may be worthwhile. There are a couple of ways you could do this.
    1. Combining all of the data together for the initial "search phase"
    2. Combining the results of multiple searches (e.g. using cuffmerge)
  3. Many "genes" will be unannotated.
  4. The results you get for expression/etc. will often be for gene_#### which has no annotations other than location in the genome. While this obviously allows for a lot of discovery, interpretation can be difficult.

Unique vs. Multi-Mapped Reads

When mapping reads to the genome or transcriptome, many reads map best to one particular location. However, some map equally well to one or more places, usually due to repetitive regions in the genome (multiple copies of genes, etc.). Most mapping algorithm allow to control what happens with these "multi-mappers". By default, TopHat will map a read in up to 20 places in the genome. Reads that map to more locations are discarded. When calculating coverage, the count is divided by the number of places a read mapped to. For example, a read that maps to two places will contribute 0.5 read to the count for each position.

Cufflinks can also attempt to improve the estimation using the 'rescue' method described in Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer and Barbara Wold, Mapping and quantifying mammalian transcriptomes by RNA-Seq, Nature Methods, Volume 5, 621 - 628 (2008).

In a BAM file created by TopHat there is an auxiliary tag (NH) that specifies the number of hits each read has. For example, NH:i:2 says that there are two hits for that read. Other mapping programs may encode this information differently.

Samtools flagstat and FastQC read count - Both samtools flagstat and FastQC will report one "read" for each mapping and NOT each unique read. This means that you cannot simply use those tools to determine how many reads were mapped. To do that something like the following Linux command will work (though it's likely rather slow):

samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l

Normalization

There are various methods of normalizing counts from RNA-Seq data. Sequencing depth, gene length, and count distribution are the main biases that must be accounted for in the normalization and/or differential expression calculations.

  • RPKM (Mortazavi et al., 2008) - Counts divided by transcript length (kb) times the total number of millions of mapped reads.
          number or reads in the region
          -----------------------------
RPKM =    total reads       region length
          -----------   X   -------------
            1000000             1000
  • FPKM (Trapnell et al., 2010) - Fragments per Kilobase of exon per Million mapped reads, analogous to RPKM.
  • Upper-quartile (Bullard et al., 2010) - Counts are divided by upper-quartile of counts for transcripts with at least one read.
  • TMM (Robinson and Oshlack, 2010) - Trimmed mean of M values.
  • Quantiles, as in microarray normalization (Irizarry et al., 2003).

BEDTools and htseq-count can be used to get raw read counts.

Differential Expression

Parametric - Model counts using known probability distributions such as Poisson, Binomial, Negative Binomial, etc.

edgeR (Robinson et al. 2010) - Exact test based on negative binomial

DESeq (Anders and Huber, 2010) - Exact test based on negative binomial

DEGseq (Wang et al., 2010) - MA-plot based methods assuming normal distribution for M/A

baySeq (Hardcastle et al. 2010) - Estimation of posterior likelihood of differential expression via empirical Baysian methods using Poisson or NB distribution.

Non-Parametric - No assumptions about data distribution are made

Fisher's exact test (use normalized counts)

Cuffdiff (Trapnell et al., 2010) - Based on entropy divergence for relative transcript abundances. Divergence is a measurement of the "distance" between the relative abundances of transcripts between two conditions.

NOISeq - Compare signal and noise distributions to determine differentially expressed genes

Handling Replicates - Biological replicates are generally considered to be required in RNA-Seq

You can also determine normalized quantities for each sample and generate scatterplots to show correlation between each sample and produce a correlation matrix to determine the extent of correlation between replicates.

Most differential expression packages handle replicates explicitly.

Visualization

  • CummeRbund - Advanced visualization of Cufflinks and Cuffdiff output using R
  • IGV - Excellent Java based genome browser for viewing BAM files from the Broad Institute
  • IGB - Another Java based genome browser that excels at viewing annotation data has can also view BAM files
  • UCSC Genome Browser - Online genome browser that provides access to many types of annotation and existing data and allows you upload your own (even view directly from galaxy).
  • Organism specific genome browsers - Many organisms provide users with custom genome browsers with the latest annotation and experimental data available online. Most also allow users to upload their own data.
    • Wormbase - Nematode Information Resource
    • PlasmoDB - Plasmodium Genomics Resource
    • SGD - Saccharomyces Genome Database

Annotation Resources

Finding gene annotations in well-formed GTF files can be a challenge, espcially if you wish to perform differential splicing analysis. You can find some explanation of the issue by Cole Trapnell in this SeqAnswers post. Below are some helpful resources.

  • Illumina's iGenomes is an excellent source of such annotation data for many organisms. Illumina has provided the RNA-Seq user community with a set of genome sequence indexes (including Bowtie indexes) as well as GTF transcript annotation files. These files can be used with TopHat and Cufflinks to quickly perform expression analysis and gene discovery. The annotation files are augmented with the tss_id and p_id GTF attributes that Cufflinks needs to perform differential splicing, CDS output, and promoter user analysis. More information about Illumina's iGenomes project.
    • Many of the iGenomes files are already available in the Princeton Galaxy server's Data Libraries. Look in Shared Data -> Data Libraries -> Reference Genomes and Annotations for your genome.
  • ENSEMBL provides both FASTA and GTF files for many organisms on their FTP site.
  • UCSC is also provides GTF files, however, annotations in these files have a major limitation. The gene_id and transcript_id are identical, making it impossible to collect transcripts from a single gene. Using the UCSC Table Browser with the settings similar to below will provide a valid, but limited GTF file.
      • clade: Mammal
      • genome: Human
      • assembly: Feb. 2009 (GRCh37/hg19)
      • group: Genes and Gene Prediction Tracks
      • track: RefSeq Genes
      • table: refFlat
      • output format: GTF - gene transfer format
  • UCSC GenePred to GTF conversion. A better way to get the UCSC gene annotations into GTF format is to use the genePredToGtf conversion program.
  • Convert from GenBank. When all else fails, you can use BioPerl scripts (genbank2gff.pl or genbank2gff3.pl) to convert a GenBank record into a GFF file. From there, you might be able to convert it to a GTF file, though it will take at least some scripting and likely some hand tweaking to get it all right.

Other Resources

Overviews

Tutorials

Tuxedo Tools (Bowtie, Tophat, Cufflinks, etc.)

Other Tools

  • MISO - Probabilistic analysis and design of RNA-Seq experiments for identifying isoform regulation
    • This software is installed on the cetus cluster (access restricted to Princeton researchers).
  • edgeR (Robinson et al. 2010)
  • DESeq (Anders and Huber, 2010)
  • DEGseq (Wang et al., 2010)
  • baySeq (Hardcastle et al. 2010)
  • NOISeq - Estimation and removal of RNA-Seq noise, non-parametric differential expression
  • Trinity - de novo transcriptome assembly
  • HTSeq - Python package and scripts to work with high throughput sequencing data including fastq, gff/gtf, and vcf formatted files