GATK
GATK is a jar file (GenomeAnalysisTK.jar) that can be invoked using:
module load java gatk
java -jar $GATK ....
There are major differences in available programs and syntax of calling GATK between version 3 and 4. The primary documentation should be the guide, and the local help used to verify the documentation approach.
Primary documentation: GATK from Broad Institute
So typical interactive usage is to request an interactive node
srun -t 1:00:00 --mem=8gb --pty bash
list available versions:
module spider gatk
----------------------------------------------------------------
Description:
Variant Discovery in High-Throughput Sequencing Data.
Versions:
gatk/3.8
gatk/4.0.1.1
gatk/4.1.7.0
----------------------------------------------------------------
For detailed information about a specific "gatk" module (including how to load the modules) use the module's full name.
For example:
$ module spider gatk/4.1.7.0
----------------------------------------------------------------
load and run the help command:
module load java gatk/<version>
java -jar $GATK -h
For GATK program specific help:
java -jar $GATK <program name> -h
Using Other Java Version
GATK requires Java version 8.0. The default version of Java on Rider is OpenJDK 8.0 which is compatible with GATK. However, if you experience any issues with that version of Java, try using Oracle's Java JDK 8.0 by loading the corresponding module:
module load java/8u121
Version 4.0.1.1
To use GATK, first we will allocate a compute node. Be sure to choose the necessary resources:
srun -t 2:00:00 -c--pty /bin/bash
GATK is a jar file (GenomeAnalysisTK.jar) that can be invoked using:
module load gatk/4.0.1.1
and then:
gatk [options]
Interactive Jobs
The following example runs GATK in an interactive session.
Copy the example files to your test directory from
cd $HOME
mkdir -p GATK_TEST
cd GATK_TEST
cp /usr/local/gatk/3.8/resources/* .
Then request a compute node (for this example, the default options are enough):
srun --pty bash
Load the GATK module:
module load gatk/4.0.1.1
Run the help command:
gatk -h
Run the examples:
gatk CountReads -R exampleFASTA.fasta -I exampleBAM.bam
Output:
Using GATK jar /usr/local/gatk/4.0.1.1/gatk-package-4.0.1.1-local.jar
Running:
...
15:08:39.433 INFO ProgressMeter - Traversal complete. Processed 33 total reads in 0.0 minutes.
To produce a pileup file
gatk Pileup -R exampleFASTA.fasta -I exampleBAM.bam -O output.txt
Output:
...
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -jar /usr/local/gatk/4.0.1.1/gatk-package-4.0.1.1-local.jar Pileup -R exampleFASTA.fasta -I exampleBAM.bam -O output.txt
...
15:10:15.937 INFO ProgressMeter - chr1:97312 0.0 2052 542378.9
15:10:15.937 INFO ProgressMeter - Traversal complete. Processed 2052 total loci in 0.0 minutes.
Parallelization
Some GATK tools are able to utilize multiple processors to break up the task into smaller parts that are analyzed in parallel. In order to achieve speedup when utilizing the parallel tools, you must request a number of processors that match the number of parallel processes you will specify to the GATK tool. For example, if we would like to use 4 parallel processes, we should request 4 processors:
srun --pty bash -c 4 --mem=8gb --time=30:00
The version 4 release of GATK uses the Apache Spark framework for parallelization, and tools that support parallel processing are suffixed with "Spark" (e.g., CountReads is the serial version and CountReadsSpark is the parallel version). To specify the number of parallel processes to use for a GATK command requires two additional switches that specify where to find Spark and how many processors to utilize. The example below runs the parallel version of CountReads on 4 processors using the local Spark instance.
gatk CountReadsSpark -R exampleFASTA.fasta -I exampleBAM.bam --spark-runner LOCAL --spark-master local[4]
Version 3.8
To use GATK, first we will allocate a compute node:
srun --time=1:00:00 -c 2 --mem=8gb --pty /bin/bash
GATK is a jar file (GenomeAnalysisTK.jar) that can be invoked using:
module load gatk
and then:
java -jar $GATK [options]
Interactive Jobs
The following example runs GATK in an interactive session.
Copy the example files to your test directory from
cd $HOME
mkdir -p GATK_TEST
cd GATK_TEST
cp /usr/local/gatk/3.8/resources/* .
Then request a compute node (for this example, the default options are enough):
srun --pty bash
Load the GATK module:
module load gatk
Run the help command:
java -jar $GATK -h
Run the examples:
java -jar $GATK -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam
Output:
INFO 12:03:02,214 HelpFormatter - ----------------------------------------------------------------------------------
INFO 12:03:02,216 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
...
INFO 12:03:02,771 CountReads - CountReads counted 33 reads in the traversal
To count the loci
java -jar $GATK -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam
Output:
INFO 12:06:57,211 HelpFormatter - ----------------------------------------------------------------------------------
INFO 12:06:57,213 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
...
INFO 12:06:57,684 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 12:06:57,684 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
2052
INFO 12:06:57,772 ProgressMeter - done 2052.0 0.0 s 42.0 s 97.3% 0.0 s 0.0 s
...
Refer to HPC Guide to Genomics & HPC Software Guide for more information.