Casava 1.7 for RNA-Seq (cluster version)

   

Casava 1.7 for RNA-Seq

The CACAVA1.7 pipeline software provided two options to to run the alignments using aligner program ELAND:

  • GERALD.pl (Pink part in above figure): for aligning the non-bar-coded reads with reference sequence.
  • demultiplexer.pl (green part in above figure, note: in CASAVA1.7 this includes two steps, demultiplexer.pl and demultiplexedGERALD.pl): for seperating bar-coded reads into differnet bins, then make alignments.

GERALD.pl: You can run GERALD.pl through PBS scheduler or you can also the software on smp interactive node.

# 1) Run GERALD.pl on example data on Oscar computer cluster: 


#log in oscar
ssh oscar

#make and go to a temporary directory for testing purpose, Notice that files in the scratch folder more than 4 weeks old will be automatically deleted
mkdir scratch/tem_test && cd scratch/tem_test

#take a look at the test date set
ls /gpfs/runtime/bioinfo/casava1.7_data_script/Illumina_Genome_Analyzer_Validation_Dataset_v_1_5_0/071112_EAS1_0089_FC20120_R1\
/Data/C2-37,39-74_Firecrest1.5.0_07-10-2009_craczy/Bustard1.5.0_07-10-2009_craczy


#copy the PBS job script from /gpfs/runtime/bioinfo/bin/pbs_gerald_batch.script
cp /gpfs/runtime/bioinfo/casava1.7_data_script/pbs_gerald_batch_rna-seq.script .

#copy the config.txt file
cp /gpfs/runtime/bioinfo/casava1.7_data_script/gerald_config_rna-seq.txt .

#submit the job
qsub pbs_gerald_batch_rna-seq.script

#check job status
showq -w user=$USER

#When job finishes, it will create a output+error file "pbs_gerald.out" . Take a look at it and see if there is any error.
tail -n 30 gerald.out

#Notice: There are two functions "gnuplot" and "xsltproc" that are not available on our cluster nodes. So the some plots and html files are missing in the folder. You will need to create them yourself:
#Otherwise, if the plots and html files are not essential for your application, this step can be ignored.
cd GERALD_$(date '+%d-%m-%Y%n')_$USER
/users/ldong/bio/bin/make_error_plot_and_rerun_make_for_gerald.py


#review the results:
The pbs cluster script will create a folder name "GERALD_today's-date_your-user-id" , which contains the alignment results. You can open the master summary file "summary.htm" with your browser to overview the results.
 
#to run alignment on your own data:
The following is the PBS cluster job submission script, to run the script for your own data, the file gerald_config_rna-seq.txt and the value of "read_folder" and "walltime" in the job script "
pbs_gerald_batch_rna-seq.script" needs to be changed accordingly:



Run Demultiplexer.pl on testing data using Oscar computer cluster (not working yet):


#log in oscar
ssh oscar

#make a temporary directory to testing purpose
mkdir data/tem_test

#go to the temporary directory
cd data/tem_test

#take a look at the test data set we are going to work on
ls -l /gpfs/runtime/bioinfo/casava1.7_data_script/TestData/Demultiplexer/PE/Bustard1.5.1_11-11-2009_craczy

#copy the config file
cp  /gpfs/runtime/bioinfo/casava1.7_data_script/demultiplex_gerald_config.txt .

#copy the PBS job script

cp  /gpfs/runtime/bioinfo/casava1.7_data_script/pbs_demultiplex_gerald_batch.script .

#submit the job
qsub pbs_demultiplex_gerald_batch.script

#check job status
showq -w user=$USER


#When job finishes, it will create a output+error file "demultiplexer_gerald.out" . Take a look at it and make sure there are no errors.
tail -n 30 demultiplexer_gerald.out

#review the results:
The pbs cluster script will create a folder named "demultiplexed", which contains folders for each bar-code, and in each bar-code folder, there is a folder called "GERALD_today's-date_your-user-id" , which contains the alignment results. You can open the master summary file "summary.htm" with your browser to overview the results. 

#to run alignment on your own data:
To run the script for your own data, the file "sampleSheet.csv", "config.template.txt" and the value of  "-input_dir" and "walltime" in the job script "
pbs_demultiplex_gerald_batch.script" needs to be changed accordingly:




Post Alignment Analysis Using CASAVA1.7

The following page explains the steps needed to process data produced by the CASAVE aligners. Post-Alignment Analysis tools (green part in above figure) can be used for detecting SNP calls, indels, genomic copy number and DGE counting.

The CACAVA1.7 pipeline software provides "run.pl" and "runRNA.pl" to prepare the task files, then "taskServer.pl" to execute the tasks.

Run "run.pl" (or "runRNA.pl")  to create the task file on test data, and then run "taskServer.pl" to execute the task using using IBM HP cluster:

#log in oscar
ssh oscar

#make a temporary directory to testing purpose
mkdir data/tem_test

#go to the temporary directory
cd data/tem_test

#take a look at the test date set
ls /gpfs/runtime/opt/casava/1.7.0/share/CASAVA-1.7.0/examples/GERALD

#copy the PBS job script
cp  /gpfs/runtime/bioinfo/casava1.7_data_script/pbs_post_align_batch.script .

#submit the job
qsub pbs_post_align_batch.script

#check job status
showq -w user=$USER

#When job finishes, it will create a output+error file "pbs_post_align.out" . Take a look at it and make sure there are no errors.
tail -n 30 pbs_post_align.out

#review the results:
The pbs cluster script will create a folder name "POST_ALIGN_today's-date_your-user-id" , which contains the analysis results. 

#to run alignment on your own data:
The following is the PBS cluster job submission script, to run the script on your own data, the value of "CASAVA_FEATURES", "CASAVA_DATA" and "walltime" in the job script "
pbs_post_align_batch.script" needs to be changed accordingly:


 #!/bin/sh
# '#PBS' is the prefix for PBS directives - see "man qsub" for additional options...
# submit this job with "qsub this_file_name"
# check on the queue with "showq"
# delete queueed jobs with "qdel job_name" (use the name found with "showq")

# name the job
#PBS -N pbs_post_align_run
#PBS -r n

# set up output file for stdout and combine stdout and stderr streams
#PBS -o pbs_post_align.out
#PBS -j oe

#get a email notice when job is done
#PBS -m e
#PBS -M your_emal@brown.edu

# request one node (implicitly 8 processors)
#PBS -l nodes=1

# specify a maximum wall clock execution limit - running over will kill job...
#PBS -l walltime=02:00:00

cd $PBS_O_WORKDIR

echo work dir is $PBS_O_WORKDIR

#PBS_O_WORKDIR=.
CASAVA_PATH=/gpfs/runtime/opt/casava/1.7.0/bin
CASAVA_FEATURES=/gpfs/runtime/opt/casava/1.7.0/share/CASAVA-1.7.0/examples/features
CASAVA_EXAMPLES=/gpfs/runtime/opt/casava/1.7.0/share/CASAVA-1.7.0/examples

out_folder=$PBS_O_WORKDIR/POST_ALIGN_PE_$(date '+%d-%m-%Y%n')_$('whoami')

#Run default paired DNA analysis targets on test E_coli data run TestEColiPE lane 4
${CASAVA_PATH}/run.pl --runId=TestEColiPE --projectDir=$out_folder \
-e ${CASAVA_EXAMPLES}/GERALD -l 4 \
--refSequences=${CASAVA_EXAMPLES}/genomes/E_coli --snpCovCutoff=-1 --indelsCovCutoff=-1

#out_folder=$PBS_O_WORKDIR/POST_ALIGN_SE_$(date '+%d-%m-%Y%n')_$('whoami')

#Run default single-ended DNA analysis targets on E_coli data run TestEColiSE lane 4
#${CASAVA_PATH}/run.pl --runId=TestEColiSE --projectDir=$out_folder \
#-e ${CASAVA_EXAMPLES}/GERALD -l 4 \
#--refSequences=${CASAVA_EXAMPLES}/genomes/E_coli --snpCovCutoff=-1 --readMode=single

#out_folder=$PBS_O_WORKDIR/POST_ALIGN_RNA_UHR_$(date '+%d-%m-%Y%n')_$('whoami')

#Run default RNA analysis targets on Human_UHR chromosome 22 data run TestRNAUHR lane 2
#${CASAVA_PATH}/runRNA.pl --runId=TestRNAUHR --projectDir=$out_folder \
#--seqGeneMdFile=${CASAVA_FEATURES}/human/NCBI/Build36.3/seq_gene.md.gz \
#-e ${CASAVA_EXAMPLES}/RNA_UHR_GERALD -l 2 \
#--refSequences ${CASAVA_EXAMPLES}/genomes/human

echo ready to run the task $(ls $out_folder/tasks*)
${CASAVA_PATH}/taskServer.pl --tasksFile=$(ls $out_folder/tasks*) --host=localhost --jobs=8

echo Finished execution at `date`


Comments