TopHat
TopHat
TopHat [1], a fast splice junction mapper for RNA-Seq reads, aligns the reads to mammalian-sized genomes using Bowtie and analyzes the mapping results to identify splice junctions between exons.
Important Notes
TopHat can utilize multiple processors in a same node to improve the performance of your job. To get the info about the number of processors in different nodes, view HPC Resource View.
TopHat can be a memory intensive job. You may want to explicitly request memory according to your need. See "parallel jobs" section.
Installed Versions
All the available versions of TopHat for use can be viewed by issuing the following command. This applies for other applications as well.
module avail tophat
output:
----------------------------------------------- /usr/local/share/modulefiles ------------------------------------------------
tophat/2.1.1
The default module can be loaded as:
module load tophat
Other versions of TopHat can be loaded as:
module load tophat/<version>
Running TopHat in HPC
Copy the directory test_data from /usr/local/doc/TOPHAT and cd to it
cp -r /usr/local/doc/TOPHAT/test_data .
cd test_data
Interactive Job Submission
Request a compute node:
srun --pty bash
Then, load the tophat module:
module load tophat
Now, run your tophat command
tophat -r 20 test_ref reads_1.fq reads_2.fq
output:
[2017-08-23 17:10:15] Beginning TopHat run (v2.1.0)
-----------------------------------------------
[2017-08-23 17:10:15] Checking for Bowtie
Bowtie 2 not found, checking for older version..
Bowtie version: 1.1.2.0
...
[2017-08-23 17:10:19] Mapping right_kept_reads_seg3 to genome segment_juncs with Bowtie (3/3)
[2017-08-23 17:10:19] Joining segment hits
[2017-08-23 17:10:19] Reporting output tracks
-----------------------------------------------
[2017-08-23 17:10:19] A summary of the alignment counts can be found in ./tophat_out/align_summary.txt
[2017-08-23 17:10:19] Run complete: 00:00:03 elapsed
Batch Job Submission
Serial Jobs
Find the job script file "serial.slurm" in test_data
Run the script
sbatch serial.slurm
You should get the output directory "tophat_out" with .bam files.
Parallel Jobs
Find the job script file "parallel.slurm" in test_data, the content of which is as showed below. You can change the value of "n" and "mem" per your job requirement. To get the info about the number of processors (n) in different nodes, view HPC Resource View.
#!/bin/bash
#SBATCH -N 1 -n 12 --mem=16gb
#SBATCH --time=10:00:00
# Load modules
module load tophat
NPROCS=$SLURM_NTASKS
# Run tophat Parallel
tophat -p $NPROCS -o SRR039999_1_par mm9 SRR039999_1.fastq
Sumbit the job
sbatch parallel.slurm
Note that with ppn=1, it took about 4 hrs and 25 minutes whereas with ppn=12, it took only 1 hr and 15 minutes.
References:
[1] TopHat Home: http://tophat.cbcb.umd.edu/
[2] Tutorial: http://www.broadinstitute.org/software/scripture/Walkthrough_example
[3] FSTQ repository: ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA012/SRA012498/SRX019275/
[4] Mice Reference: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml