RNA-Seq analysis workflow using Galaxy and DESeq2 Below is a simple workflow to analyze RNA-Seq data (Illumina single-end) using Galaxy and the DESeq2 package from Bioconductor.
For an alternative method of quantification and differential expression, see our Tuxedo tools tutorial.
1 - Import data into Galaxy
2 - Run FastQC on fastq file(s) for each sample/replicate. Look for low quality, bias in sequence, etc.
3 - Optional - Trim poor quality bases and/or adapter sequences from reads
4 - Upload or import a GTF file for the genome you are working with.
5 - Map reads to genome using TopHat. TopHat accounts for reads that cross splice junction boundaries (see TopHat documentation).
Repeat the mapping steps for other samples.
6 - Visualize the data in Galaxy Trackster
7 - Check 5'-3' bias
Reads may not be aligned uniformly across the entire gene body, due to biases associated to the different steps of RNA-seq protocols (fragmentation, RNA degradation, secondary structures, etc.). Uniform coverage of genes is best for quantitation, however, it is also important that each sample have comparable 5'-3' bias. This step will help check the bias in each sample.
8 - Use htseq-count to count the reads that align to each exon and sum them up for each gene
Repeat the quantitation steps for other samples.
9 - Download the gene counts files to be used with DESeq2 (or other differential expression software).
10 - Setup the sample table
directory <- "directory/with/htseq-count/files"
sampleFiles <- grep("treated",list.files(directory),value=TRUE)
sampleCondition <- sub("(.*treated).*","\\1",sampleFiles)
sampleTable <- data.frame(sampleName = sampleFiles,
fileName = sampleFiles,
condition = sampleCondition)
11 - Load the data into R
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = directory, design= ~ condition)
colData(dds)$condition <- factor(colData(dds)$condition,
levels=c("untreated","treated"))
12 - Normalize samples
dds <- estimateSizeFactors(dds)
13 - Sample QC: Transform data to log space and visualize samples
rld <- rlogTransformation(dds, blind = TRUE)
print(plotPCA(rld, intgroup = c("condition")))
14 - Estimate biological variance and visualize
dds <- estimateDispersions(dds)
plotDispEsts(dds)
15 - Determine differential expression
dds <- nbinomWaldTest(dds)
16 - Export results to file
# Get results and write to output files
res <- results(dds)
res <- res[order(res$padj), ]
write.table(cbind('#ID'=rownames(as.data.frame(res)), as.data.frame(res)),
file='deseq2_results.txt', quote=FALSE, sep='\t', row.names=FALSE)
# Display the top few genes
head(res)
17 - Visualize with MA plot
plotMA(dds,ylim=c(-2,2),main="DESeq2")
Splice Junctions
Mapping reads across splice junctions is one of the main challenges in RNA-Seq. There are a few approaches:
When using either of the de novo approaches, there are few additional considerations:
Unique vs. Multi-Mapped Reads
When mapping reads to the genome or transcriptome, many reads map best to one particular location. However, some map equally well to one or more places, usually due to repetitive regions in the genome (multiple copies of genes, etc.). Most mapping algorithm allow to control what happens with these "multi-mappers". By default, TopHat will map a read in up to 20 places in the genome. Reads that map to more locations are discarded. When calculating coverage, the count is divided by the number of places a read mapped to. For example, a read that maps to two places will contribute 0.5 read to the count for each position.
Cufflinks can also attempt to improve the estimation using the 'rescue' method described in Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer and Barbara Wold, Mapping and quantifying mammalian transcriptomes by RNA-Seq, Nature Methods, Volume 5, 621 - 628 (2008).
In a BAM file created by TopHat there is an auxiliary tag (NH) that specifies the number of hits each read has. For example, NH:i:2 says that there are two hits for that read. Other mapping programs may encode this information differently.
Samtools flagstat and FastQC read count - Both samtools flagstat and FastQC will report one "read" for each mapping and NOT each unique read. This means that you cannot simply use those tools to determine how many reads were mapped. To do that something like the following Linux command will work (though it's likely rather slow):
samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l
There are various methods of normalizing counts from RNA-Seq data. Sequencing depth, gene length, and count distribution are the main biases that must be accounted for in the normalization and/or differential expression calculations.
number or reads in the region
-----------------------------
RPKM = total reads region length
----------- X -------------
1000000 1000
BEDTools and htseq-count can be used to get raw read counts.
Parametric - Model counts using known probability distributions such as Poisson, Binomial, Negative Binomial, etc.
edgeR (Robinson et al. 2010) - Exact test based on negative binomial
DESeq (Anders and Huber, 2010) - Exact test based on negative binomial
DEGseq (Wang et al., 2010) - MA-plot based methods assuming normal distribution for M/A
baySeq (Hardcastle et al. 2010) - Estimation of posterior likelihood of differential expression via empirical Baysian methods using Poisson or NB distribution.
Non-Parametric - No assumptions about data distribution are made
Fisher's exact test (use normalized counts)
Cuffdiff (Trapnell et al., 2010) - Based on entropy divergence for relative transcript abundances. Divergence is a measurement of the "distance" between the relative abundances of transcripts between two conditions.
NOISeq - Compare signal and noise distributions to determine differentially expressed genes
Handling Replicates - Biological replicates are generally considered to be required in RNA-Seq
You can also determine normalized quantities for each sample and generate scatterplots to show correlation between each sample and produce a correlation matrix to determine the extent of correlation between replicates.
Most differential expression packages handle replicates explicitly.
Finding gene annotations in well-formed GTF files can be a challenge, espcially if you wish to perform differential splicing analysis. You can find some explanation of the issue by Cole Trapnell in this SeqAnswers post. Below are some helpful resources.
Overviews
Tutorials
Tuxedo Tools (Bowtie, Tophat, Cufflinks, etc.)
Other Tools