TopHat

TopHat [1], a fast splice junction mapper for RNA-Seq reads, aligns the reads to mammalian-sized genomes using Bowtie and analyzes the mapping results to identify splice junctions between exons.

Important Notes

TopHat can utilize multiple processors in a same node to improve the performance of your job. To get the info about the number of processors in different nodes, view HPC Resource View.
TopHat can be a memory intensive job. You may want to explicitly request memory according to your need. See "parallel jobs" section.

Installed Versions

All the available versions of TopHat for use can be viewed by issuing the following command. This applies for other applications as well.

module avail tophat

output:

----------------------------------------------- /usr/local/share/modulefiles ------------------------------------------------

tophat/2.1.1

The default module can be loaded as:

module load tophat

Other versions of TopHat can be loaded as:

module load tophat/<version>

Running TopHat in HPC

Copy the directory test_data from /usr/local/doc/TOPHAT and cd to it

cp -r /usr/local/doc/TOPHAT/test_data .

cd test_data

Interactive Job Submission

Request a compute node:

srun --pty bash

Then, load the tophat module:

module load tophat

Now, run your tophat command

tophat -r 20 test_ref reads_1.fq reads_2.fq

output:

[2017-08-23 17:10:15] Beginning TopHat run (v2.1.0)

-----------------------------------------------

[2017-08-23 17:10:15] Checking for Bowtie

Bowtie 2 not found, checking for older version..

Bowtie version: 1.1.2.0

...

[2017-08-23 17:10:19] Mapping right_kept_reads_seg3 to genome segment_juncs with Bowtie (3/3)

[2017-08-23 17:10:19] Joining segment hits

[2017-08-23 17:10:19] Reporting output tracks

-----------------------------------------------

[2017-08-23 17:10:19] A summary of the alignment counts can be found in ./tophat_out/align_summary.txt

[2017-08-23 17:10:19] Run complete: 00:00:03 elapsed

Batch Job Submission

Serial Jobs

Find the job script file "serial.slurm" in test_data

Run the script

sbatch serial.slurm

You should get the output directory "tophat_out" with .bam files.

Parallel Jobs

Find the job script file "parallel.slurm" in test_data, the content of which is as showed below. You can change the value of "n" and "mem" per your job requirement. To get the info about the number of processors (n) in different nodes, view HPC Resource View.

#!/bin/bash

#SBATCH -N 1 -n 12 --mem=16gb

#SBATCH --time=10:00:00

# Load modules

module load tophat

NPROCS=$SLURM_NTASKS

# Run tophat Parallel

tophat -p $NPROCS -o SRR039999_1_par mm9 SRR039999_1.fastq

Sumbit the job

sbatch parallel.slurm

Note that with ppn=1, it took about 4 hrs and 25 minutes whereas with ppn=12, it took only 1 hr and 15 minutes.

References:

[1] TopHat Home: http://tophat.cbcb.umd.edu/

[2] Tutorial: http://www.broadinstitute.org/software/scripture/Walkthrough_example

[3] FSTQ repository: ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA012/SRA012498/SRX019275/

[4] Mice Reference: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml