Tophat, Cufflinks and SamTools

Let's start to work first. While program running, you can read more here.

In this tutorial, I will analyze mouse paired-end RNA-Seq data from Illumna GAII to calculate expression level of known AceView genes and alternative splicing variants, and identify novel gene and novel alternative splicing variants. If you you need to analyze RNA-Seq data from other species, please let me know. I will prepare the genome reference and gene annotation file according to your request.

Note: You can copy and paste all the text in blue to your Linux command line to run. Anything with "#" is comment, and will be IGNORED by Linux.

#1) ssh to smp node

# Because some scripts need large memory, also some of them can run multiple threads to speed up the analysis, I prefer to use the smp nodes.
# Please you should  not use the head nodes to run these analysis.
ssh ssh.ccv.brown.edu
ssh -Y smp007
# Note: option -Y will allow linux to forward graphic output from R, or Java

# Check if the node is busy. If yes, try other smp nodes. The available nodes are from "smp001" to "smp007"
bash
top

#2) run tophat to map reads onto genome and to identify genes and alternative splicing variants

# Create a folder to work within. Notice that files in the scratch folder more than 4 weeks old will be automatically deleted.
mkdir
scratch/cufflinks-test &&  cd scratch/cufflinks-test

# Here, I use lane 1 as example
export LANE=s_1

# Copy the raw illumina sequence file to the current folder
cp path_to_your_read_files/
${LANE}_*_sequence.txt .

# Here I use GERALD ouput file in _sequence.txt (fastq) format as my read files. If you have raw qseq.txt format from Bustard, the following command can do the conversion. For lane 1:
# read_files=/xyz/xyz/Bustard07-10-2010_xyz/s_1_1_*_qseq.txt 
# cat $read_files | perl /gpfs/runtime/bioinfo/bin/qseq2fastq.pl > s_1_1_sequence.txt

# read_files=/xyz/xyz/Bustard07-10-2010_xyz/s_1_2_*_qseq.txt 
# cat $read_files | perl /gpfs/runtime/bioinfo/bin/qseq2fastq.pl > s_1_2_sequence.txt

# Run FastQC to check sequence quality (this step is optional, see this link for detail on how to use FastQC: https://sites.google.com/a/brown.edu/bioinformatics-in-biomed/fastqc)
module load fastqc
fastqc

# Setup bowtie executable and index path for Tophat
export PATH=/gpfs/runtime/bioinfo/bowtie:/gpfs/runtime/bioinfo/samtools-0.1.8/misc:
/gpfs/runtime/bioinfo/bin:$PATH
export BOWTIE_INDEXES=/gpfs/runtime/bioinfo/bowtie/indexes/

# Setup email address to get a email notice when each step is done. You need use your own email address in stead of mike_wood@brown.edu
export MY_EMAIL=mike_wood@brown.edu

# Check to make sure the email notice function works. After the command finishes for a few seconds, you should get an email notice telling you the command is finished.
notice_me


# Run tophat. This will need a few hours.
# Here I use 5 threads
# The estimated library average fragment size is 280, the read length is 60bp, so the inner distange between paired reads (--mate-inner-dist) is 160.
# The estimated standard deviation for the distribution on inner distances between mate pairs is 60bp. The default value is 20bp.
# --keep-tmp will keep the temporary files including bowtie output files.
# mm9_all_folded_with_chrID is the mouse genome reference, which is kept in the bowtie index folder.
#
${LANE}_1_sequence.txt is the same as s_1_1_sequence.txt because I set the value of "LANE" to "s_1". This file is forward reads.
# ${LANE}_2_sequence.txt is the same as s_1_2_sequence.txt because I set the value of "LANE" to "s_1". This file is reverse reads.
 
/gpfs/runtime/bioinfo/tophat-1.1.0.Linux_x86_64/tophat --num-threads 5 --mate-inner-dist 160 \
    --mate-std-dev 60 --solexa1.3-quals --max-multihits 10  --coverage-search \
    --microexon-search -o tophat_out-$LANE  --keep-tmp -G /gpfs/runtime/bioinfo/aceviewdb/aceview_annotation_mm9_with_chrID.gtf mm9_all_folded_with_chrID \
     ${LANE}_1_sequence.txt ${LANE}_2_sequence.txt && notice_me


#2-1) get high level summary on the alignment. You can ignore this step if you don't need the high level statistics.

# In last step, I used --keep-tmp when I call tophat and kept bowtie output files. Here I can summarize the mapping statistics: how many read in total, how many aligned, how many aligned to exon and so on.
/gpfs/runtime/bioinfo/bin/summarize_bowtie_mapping_result.pl /gpfs/runtime/bioinfo/aceviewdb/aceview_annotation_mm9_with_chrID.gtf  \
        ${LANE}_1_sequence.txt tophat_out-$LANE/tmp/left_kept_reads.bwtout && notice_me


# Because the script can run for a while, you can try the script by using option 'try-10000-only' to read only 10000 exons and only 10000 alignments. Running the script without options will give you the details:
/gpfs/runtime/bioinfo/bin/summarize_bowtie_mapping_result.pl

# Do the same thing on read2:
/gpfs/runtime/bioinfo/bin/summarize_bowtie_mapping_result.pl /gpfs/runtime/bioinfo/aceviewdb/aceview_annotation_mm9_with_chrID.gtf  \
        ${LANE}_2_sequence.txt tophat_out-$LANE/tmp/right_kept_reads.bwtout && notice_me


#3) run cufflinks to estimate expression level of identified genes and alternative splicing variants

# This needs a few minutes
cd tophat_out-$LANE
/gpfs/runtime/bioinfo/cufflinks-0.9.1.Linux_x86_64/cufflinks -p 5 --inner-dist-mean 160 -s 60 \
    -r /gpfs/runtime/bioinfo/bowtie/indexes/mm9_all_folded_with_chrID.fa accepted_hits.bam
&& notice_me
cd ..

#4) run cuffcompare to compare your RNA-Seq sample with Aceview gene annotation database

# This will need a few minuts
/gpfs/runtime/bioinfo/
cufflinks-0.9.1.Linux_x86_64/cuffcompare  -r /gpfs/runtime/bioinfo/aceviewdb/aceview_annotation_mm9_with_chrID.gtf -R -V \
    -o cuffcom-out-$LANE  tophat_out-$LANE/transcripts.gtf 
&& notice_me

#5) calculate expression level for known Aceview gene and alternative splicing variants

#  This will need a few minutes
#  This will create an excel file s_1_expression_of_known_gene_&_alternative_splicing.xls' with two sheets:
#  One for gene expression and the other for alternative splicing variant expression level.

/gpfs/runtime/bioinfo/bin/find_known_gene_and_alternative_splicing_expression_from_cuffcompare_result.pl tophat_out-$LANE/transcripts.tmap && notice_me

# Open the Excel table to see if your favorite gene is there. And click the link in the first column to open AceView genome browser to look at the gene.
# The two temporary files named "tem_xxxxxxxxx" are also created for later merging with other lanes. You can ignore them for now. 

#6) merge results from multiple lanes for the known gene and alternative transcripts

# To create Excel tables for known genes, known alternative splicing, for multiple lanes (e.g. s_1 and s_2 below)
# You need to make sure that the two lanes are analyzed (from step 2 to step 6).

/gpfs/runtime/bioinfo/bin/merge_gene_and_alternative_splicing_for_multiple_lanes.pl known_only s_1 s_2

# This will create two Excel files. Open them to check the results: 
# s_1-s_2_expression_of_known_gene_&_alternative_splicing.xls

#7) find novel alternative splicing variants which related to known genes

# Convert accepted.bam to accepted.sam for later query usage
/gpfs/runtime/bioinfo/samtools-0.1.8/misc/samtools view -h -o tophat_out-$LANE/accepted_hits.sam tophat_out-$LANE/accepted_hits.bam && notice_me

# Find novel transcript with related genes, and express more than 100. This will need about half an hour.
/gpfs/runtime/bioinfo/bin/find_novel_alternative_splicing_expression_from_cuffcompare_result_wig.pl tophat_out-$LANE/transcripts.tmap 100  && notice_me

# This will create five files:

# 1)    A gtf file which includes the structure of the novel transcripts. The file name is:  s_1_novel_scripts_threshold_100.gtf

#       Click the following link to open UCSC genome browser:
#       http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Mouse&db=mm9&Submit=submit   

#       Upload the gtf file as a custom track, turn on Aceview track and hide other tracks you are not interested in.

# 2)    A bam file s_1_novel_transcripts_related_region_threshold_100.sorted.bam, which contains alignment information for the genome regions related to the new transcripts.


# 3)    A index file for the bam file: s_1_novel_transcripts_related_region_threshold_100.sorted.bam.bai


#         Upload the bam file and the index file as custom track as discribed in
here (need http or ftp server, I didn't find a good free server yet.). 

# 4)    An excel file expression_novel_transcripts.xls, which contains the expression level, related gene name and range on the chromosome.
#        Open the Excel table to see if your favorate gene is there. Click the link in the table to open UCSC genome browser to look at the region on genome.


#        Now, you can compare you novel transcript with the known AceView transcripts to see the difference.

#        Also you can check the bam track to make sure there are enough reads to support your novel transcript, especial the unique part in the novel transcript.  

# 5)    A wig file which can be uploaded to UCSC web browser for you to review the results. The wig file does not need a ftp server which is different from the bam file.

# Note: A temporary file named "tem_xxxxxxxxx" is also created for later merging with other lanes. You can ignore it for now.   


#8) merge results from multiple lanes for novel transcripts

# To create Excel tables for known genes, known alternative splicing, and novel alternative splicing, run this command with the lane IDs (e.g. s_1 and s_2 below)
# You need to make sure that the two lanes are analyzed (from step 2 to step 6).

/gpfs/runtime/bioinfo/bin/merge_gene_and_alternative_splicing_for_multiple_lanes.pl known_only s_1 s_2

# This will create two Excel files. Open them to check the results: 
# s_1-s_2_expression_novel_transcipts.xls


Note: step 6 and step 8 can be done all together by:
/gpfs/runtime/bioinfo/bin/merge_gene_and_alternative_splicing_for_multiple_lanes.pl all s_1 s_2

#9) Identify SNPs and indels for whole transcriptome

# Sort the bam file first
samtools sort tophat_out-$LANE/accepted_hits.bam tophat_out-$LANE/accepted_hits.sorted
&& notice_me

# Identify raw candidate SNPs and indels
samtools pileup -vcf  /gpfs/runtime/bioinfo/bowtie/indexes/mm9_all_folded_with_chrID.fa  tophat_out-$LANE/accepted_hits.sorted.bam > ${LANE}_raw_all.pileup
&& notice_me

# Query to get high quality SNPs and indels. 20 is threshold for the the minimum quality score.
samtools.pl varFilter
${LANE}_raw_all.pileup | awk '$6>=20' > ${LANE}_final_all.pileup 
&& notice_me
# Let me know if you need help to query the results by other criteria. Please click here.
# For detail about the meaning of each column please visit: http://samtools.sourceforge.net/cns0.shtml

#9) Identify SNPs and indels with in the novel transcripts' region on chromosome

# Identify raw candidate SNPs and indels (Note: the bam file is sorted in step 6).
samtools pileup -vcf  /gpfs/runtime/bioinfo/bowtie/indexes/mm9_all_folded_with_chrID.fa ${LANE}_novel_transcripts_related_region_threshold_100.bam  >
${LANE}_raw_novel.pileup && notice_me

# Query to get high quality SNPs and indels. 20 is threshold for the the minimum quality score.
samtools.pl varFilter
${LANE}_raw_novel.pileup | awk '$6>=20' > ${LANE}_final_novel.pileup && notice_me  

# Let me know if you need help to query the results by other criteria. Please click here.
# For detail about the meaning of each column please visit: http://samtools.sourceforge.net/cns0.shtml

# Let me know if you have any question.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Read more here


What is TopHat?


TopHat
is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.

What types of reads can I use TopHat with?

TopHat was designed to work with reads produced by the Illumina Genome Analyzer, although users have been successful in using TopHat with reads from other technologies. In TopHat 1.1.0, we began supporting Applied Biosystems' Colorspace format. The software is optimized for reads 75bp or longer.

Currently, TopHat does not allow short (fewer than a few nucleotides) insertions and deletions in the alignments it reports. Support for insertions and deletions will eventually be added.

Finally, mixing paired- and single- end reads together is not supported.

How does TopHat find junctions?

TopHat finds splice junctions without a reference annotation. By first mapping RNA-Seq reads to the genome, TopHat identifies potential exons, since many RNA-Seq reads will contiguously align to the genome. Using this initial mapping, TopHat builds a database of possible splice junctions, and then maps the reads against this junction to confirm them.

Short read sequencing machines can currently produce reads 100bp or longer, but many exons are shorter than this, and so would be missed in the initial mapping. TopHat solves this problem by splitting all input reads into smaller segments, and then mapping them independently. The segment alignments are "glued" back together in a final step of the program to produce the end-to-end read alignments.

TopHat generates its database of possible splice junctions from three sources of evidence. The first source is pairings of "coverage islands", which are distinct regions of piled up reads in the initial mapping. Neighboring islands are often spliced together in the transcriptome, so TopHat looks for ways to join these with an intron. The second source is only used when TopHat is run with paired end reads. When reads in a pair come from different exons of a transcript, they will generally be mapped far apart in the genome coordinate space. When this happens, TopHat tries to "close" the gap between them by looking for subsequences of the genomic interval between mates with a total length about equal to the expected distance between mates. The "introns" in this subsequence are added to the database. The third, and strongest, source of evidence for a splice junction is when two segments from the same read are mapped far apart, or when an internal segment fails to map. With long (>=75bp) reads, "GT-AG", "GC-AG" and "AT-AC" introns be found ab initio. With shorter reads, TopHat only reports alignments across "GT-AG" introns

If you want to know the details, please check the project home page: http://tophat.cbcb.umd.edu/.   Tophat user manual can be found in here: http://tophat.cbcb.umd.edu/manual.html

What is Cufflinks?

Cufflinks takes a text file of SAM alignments as input. The RNA-Seq read mapper TopHat produces output in this format, and is recommended for use with Cufflinks.

Cufflinks
assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one.

How does Cufflinks assemble transcripts?

Cufflinks constructs a parsimonious set of transcripts that "explain" the reads observed in an RNA-Seq experiment. It does so by reducing the comparative assembly problem to a problem in maximum matching in bipartite graphs. In essence, Cufflinks implements a constructive proof of Dilworth's Theorem by constructing a covering relation on the read alignments, and finding a minimum path cover on the directed acyclic graph for the relation.

While Cufflinks works well with unpaired RNA-Seq reads, it is designed with paired reads in mind. The assembly algorithm explicitly handles paired end reads by treating the alignment for a given pair as a single object in the covering relation. The proof of Dilworth's theorem finds a maximum cardinality matching on the bipartite graph of the transitive closure of the DAG. However, there is not necessarily a unique maximum cardinality matching, reflecting the fact that due to the limited size of RNA-Seq cDNA fragments, we may not know with certainty which outcomes of alternative splicing events go together in the same transcripts. Cufflinks tries to find the correct parsimonious set of transcripts by performing a minimum cost maximum matching. The cost of associating splicing events is based on the "percent-spliced-in" score introduced in

This matching is then extended to a minimum cost path cover of the DAG, with each path representing a different transcript.

The algorithm builds on ideas behind the ShoRAH algorithm for haplotype abundance estimation in viral populations, described in:

The assembler also borrows some ideas introduced with the PASA algorithm for annotating genomes from EST and full length mRNA evidence, described in:
  • Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res, 31, 5654-5666.

Cufflinks is implemented in C++ and makes substantial use of the Boost Libraries as well as the LEMON Graph Library, which was launched by the Egerváry Research Group on Combinatorial Optimization (EGRES).

How does Cufflinks calculate transcript abundances?

In RNA-Seq experiments, cDNA fragments are sequenced and mapped back to genes and ideally, individual transcripts. Properly normalized, the RNA-Seq fragment counts can be used as a measure of relative abundance of transcripts, and Cufflinks measures transcript abundances in Fragments Per Kilobase of exon per Million fragments mapped (FPKM), which is analagous to single-read "RPKM", originally proposed in:

In paired-end RNA-Seq experiments, fragments are sequenced from both ends, providing two reads for each fragment. To estimate isoform-level abundances, one must assign fragments to individual transcripts, which may be difficult because a read may align to multiple isoforms of the same gene. Cufflinks uses a statistical model of paired-end sequencing experiments to derive a likelihood for the abundances of a set of transcripts given a set of fragments. This likelihood function can be shown to have a unique maximum, which Cufflinks finds using a numerical optimization algorithm. The program then multiplies these probabilities to compute the overall likelihood that one would observe the fragments in the experiment, given the proposed abundances on the transcripts. Because Cufflinks' statistical model is linear, the likelihood function has a unique maximum value, and Cufflinks finds it with a numerical optimization algorithm.

When some isoforms have much lower abundance than another from the same gene, or when the number of reads for the locus is very small, this numerical procedure is less accurate. To place confidences on the reported abundances, we have adapted the Bayesian inference procedure proposed by Hui Jiang and Wing Wong for single read RNA-seq in

Using these statistical methods, Cufflinks can estimate the abundances of the isoforms present in the sample, either using a known "reference" annotation, or after an ab-initio assembly of the transcripts using only the reference genome.

If you want to know the details, please check the project home page: http://cufflinks.cbcb.umd.edu/index.html. Cufflinks Manual is here:  http://cufflinks.cbcb.umd.edu/manual.html