blast by qsub

This tutorial will show you how to divide your sequences into smaller files, then submit the blast jobs to Oscar computer cluster, so that the mapping will finish faster than running it on single machine. After making sure all blast jobs finishes successfully, we will merge the output files together for the down-stream data analysis. 
Here, I will run blastx and map my contigs (saved as  /gpfs/data/blastdb/example_contig.fasta on Oscar) from Illumina sequencing assembling onto nr protein database. 

Note: this script is based on the original script by Mark: /gpfs/scratch/mhowison/Patrick/multiblastx.py

!!! Note: pay attention to use the correct genetic code (option: -query_gencode) when running blastx, the default is 1.

# first login to oscar, and go to scratch
ssh oscar
cd scratch/

# load biomed module. This will set the path for the python script
module load biomed

# Take a look the usage information:
blast-by-qsub-v1.1.py
usage: blastx-by-qsub-v1.1.py           query.fa              num_seq_per_job            blast_program                db                number-of-nodes-to-use             wall_time             email_address

# divide my contigs into smaller files, each one contains 200 contigs, and create a pbs qsub script for blastx on to nr protein database
blast-by-qsub-v1.1.py    /gpfs/data/blastdb/example_contig.fasta            200                   blastx           nr     12       3:00:00          xxx@brown.edu

# Here is the output:
To submit the jobs, run the following command:
qsub example_contig.fasta_nr.pbs

# Take a look at the pbs qsub script Certainly, you can edit the script on how to run the blast. Please refer blast manual for the options.
less example_contig.fasta_nr.pbs

#!/bin/bash
#PBS -N example_contig.fasta
#PBS -t 0-4%12
#PBS -l nodes=1
#PBS -l walltime=3:00:00
#PBS -j oe
#PBS -M xxx@brown.edu
module load blast

cd $PBS_O_WORKDIR
echo done > tmp/example_contig.fasta_nr_$PBS_ARRAYID.out.start
blastx -db nr \
       -query tmp/example_contig.fasta_$PBS_ARRAYID.fa \
       -out tmp/example_contig.fasta_nr_$PBS_ARRAYID.out \
       -outfmt 6 \
       -num_threads 8 \
       -num_alignments 10
echo done > tmp/example_contig.fasta_nr_$PBS_ARRAYID.out.start.done

# It is a good idea to run 1 or 2 of them before submit the whole thing to the cluster, in case something wrong, such as proper walltime, blast output format, filter and so on. 
# So, I make a copy of the script and work on the copy, and edit it to work only on job 0 and 1 to see what happens. Note: changed parts are highlighted red.
cp example_contig.fasta_nr.pbs example_contig.fasta_nr.pbs1
vi  example_contig.fasta_nr.pbs1
#!/bin/bash
#PBS -N example_contig.fasta
#PBS -t 0-1%12
#PBS -l nodes=1
#PBS -l walltime=3:00:00
#PBS -j oe
#PBS -M xxx@brown.edu
module load blast

cd $PBS_O_WORKDIR
echo done > tmp/example_contig.fasta_nr_$PBS_ARRAYID.out.start
blastx -db nr \
       -query tmp/example_contig.fasta_$PBS_ARRAYID.fa \
       -out tmp/example_contig.fasta_nr_$PBS_ARRAYID.out \
       -outfmt 6 \
       -num_threads 8 \
       -num_alignments 10
echo done > tmp/example_contig.fasta_nr_$PBS_ARRAYID.out.start.done

# After edit the script, press Exc and :wq to save it. The submit the two job to cluster:
qsub example_contig.fasta_nr.pbs1
1372422[].mgt

# Check the job status:
showq -u $USER

# You should receive a email when the jobs finish. 
# Check if the jobs finish successfully
ls -l tmp/*done
-rw-r----- 1 ldong ccvstaff 5 Dec  2 14:20 tmp/example_contig.fasta_nr_0.out.start.done
-rw-r----- 1 ldong ccvstaff 5 Dec  2 14:20 tmp/example_contig.fasta_nr_1.out.start.done

# The two jobs are finished. If you have a lot jobs, causing hard to count by eye, you can find the number of finished job by:
ls -l tmp/*done | wc -l
2

# You can see the actual time used for the job from the job output:
less example_contig.fasta.o1372422-0
----------------------------------------
Begin PBS Prologue Fri Dec  2 15:07:24 EST 2011 1322856444
Job ID:         1372422[0].mgt
Username:       ldong
Group:          illumina
Nodes:          node154
End PBS Prologue Fri Dec  2 15:07:25 EST 2011 1322856445
----------------------------------------
----------------------------------------
Begin PBS Epilogue Fri Dec  2 15:43:41 EST 2011 1322858621
Job ID:         1372422[0].mgt
Username:       ldong
Group:          illumina
Job Name:       example_contig.fasta-0
Session:        23863
Limits:         neednodes=1,nodes=1,walltime=03:00:00
Resources:      cput=03:59:08,mem=5216848kb,vmem=11583308kb,walltime=00:36:12
Queue:          dq
Account:                
Nodes:  node154
Killing leftovers...

End PBS Epilogue Fri Dec  2 15:43:41 EST 2011 1322858621

# So we only used about 40 minutes for the fist job. So 1:30:00 should enough instead of 3 hours.
# Now, I will change the script and submit other jobs. Note: changed parts are highlighted red.
cp example_contig.fasta_nr.pbs example_contig.fasta_nr.pbs2
vi  example_contig.fasta_nr.pbs2
#!/bin/bash
#PBS -N example_contig.fasta
#PBS -t 2-4%12
#PBS -l nodes=1
#PBS -l walltime=1:30:00
#PBS -j oe
#PBS -M xxx@brown.edu
module load blast

cd $PBS_O_WORKDIR
echo done > tmp/example_contig.fasta_nr_$PBS_ARRAYID.out.start
blastx -db nr \
       -query tmp/example_contig.fasta_$PBS_ARRAYID.fa \
       -out tmp/example_contig.fasta_nr_$PBS_ARRAYID.out \
       -outfmt 6 \
       -num_threads 8 \
       -num_alignments 10
echo done > tmp/example_contig.fasta_nr_$PBS_ARRAYID.out.start.done

# now, submit the jobs:
qsub example_contig.fasta_nr.pbs2

# Check if the jobs finished
ls -l tmp/*done | wc -l
5

# Combine the blast results into a single file
cat tmp/*.out > example_contig_blastx_nr.blx

Let me know if you have any question.   



Comments