GATK

GATK is a jar file (GenomeAnalysisTK.jar) that can be invoked using:

module load java gatk

java -jar $GATK ....

There are major differences in available programs and syntax of calling GATK between version 3 and 4. The primary documentation should be the guide, and the local help used to verify the documentation approach. 

Primary documentation: GATK from Broad Institute 

So typical interactive usage is to request an interactive node

srun -t 1:00:00 --mem=8gb --pty bash

list available versions:

module spider gatk

----------------------------------------------------------------

    Description:

      Variant Discovery in High-Throughput Sequencing Data.


     Versions:

        gatk/3.8

        gatk/4.0.1.1

        gatk/4.1.7.0


----------------------------------------------------------------

  For detailed information about a specific "gatk" module (including how to load the modules) use the module's full name.

  For example:


     $ module spider gatk/4.1.7.0

----------------------------------------------------------------


load and run the help command:

module load java gatk/<version>

java -jar $GATK -h

For GATK program specific help:

java -jar $GATK <program name> -h

Using Other Java Version

GATK requires Java version 8.0. The default version of Java on Rider is OpenJDK 8.0 which is compatible with GATK. However, if you experience any issues with that version of Java, try using Oracle's Java JDK 8.0 by loading the corresponding module:

module load java/8u121

Version 4.0.1.1

To use GATK, first we will allocate a compute node. Be sure to choose the necessary resources:

srun -t 2:00:00 -c--pty /bin/bash

GATK is a jar file (GenomeAnalysisTK.jar) that can be invoked using:

module load gatk/4.0.1.1

and then:

gatk [options]

Interactive Jobs

The following example runs GATK in an interactive session.

Copy the example files to your test directory from

cd $HOME

mkdir -p GATK_TEST

cd GATK_TEST

cp /usr/local/gatk/3.8/resources/* .

Then request a compute node (for this example, the default options are enough):

srun --pty bash

Load the GATK module:

module load gatk/4.0.1.1

Run the help command:

gatk -h

Run the examples:

gatk CountReads -R exampleFASTA.fasta -I exampleBAM.bam

Output:

Using GATK jar /usr/local/gatk/4.0.1.1/gatk-package-4.0.1.1-local.jar

Running:

... 

15:08:39.433 INFO  ProgressMeter - Traversal complete. Processed 33 total reads in 0.0 minutes. 

To produce a pileup file

gatk Pileup -R exampleFASTA.fasta -I exampleBAM.bam -O output.txt

Output:

...

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -jar /usr/local/gatk/4.0.1.1/gatk-package-4.0.1.1-local.jar Pileup -R exampleFASTA.fasta -I exampleBAM.bam -O output.txt

...

15:10:15.937 INFO  ProgressMeter -           chr1:97312              0.0                  2052         542378.9

15:10:15.937 INFO  ProgressMeter - Traversal complete. Processed 2052 total loci in 0.0 minutes.

Parallelization

Some GATK tools are able to utilize multiple processors to break up the task into smaller parts that are analyzed in parallel. In order to achieve speedup when utilizing the parallel tools, you must request a number of processors that match the number of parallel processes you will specify to the GATK tool. For example, if we would like to use 4 parallel processes, we should request 4 processors:

srun --pty bash -c 4 --mem=8gb --time=30:00

The version 4 release of GATK uses the Apache Spark framework for parallelization, and tools that support parallel processing are suffixed with "Spark" (e.g., CountReads is the serial version and CountReadsSpark is the parallel version). To specify the number of parallel processes to use for a GATK command requires two additional switches that specify where to find Spark and how many processors to utilize. The example below runs the  parallel version of CountReads on 4 processors using the local Spark instance.

gatk CountReadsSpark -R exampleFASTA.fasta -I exampleBAM.bam --spark-runner LOCAL --spark-master local[4]

Version 3.8

To use GATK, first we will allocate a compute node:

srun --time=1:00:00 -c 2 --mem=8gb --pty /bin/bash

GATK is a jar file (GenomeAnalysisTK.jar) that can be invoked using:

module load gatk

and then:

java -jar $GATK [options]

Interactive Jobs

The following example runs GATK in an interactive session.

Copy the example files to your test directory from 

cd $HOME

mkdir -p GATK_TEST

cd GATK_TEST

cp /usr/local/gatk/3.8/resources/* .

Then request a compute node (for this example, the default options are enough):

srun --pty bash

Load the GATK module:

module load gatk

Run the help command:

java -jar $GATK -h

Run the examples:

java -jar $GATK -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam

Output:

INFO 12:03:02,214 HelpFormatter - ---------------------------------------------------------------------------------- 

INFO 12:03:02,216 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50 

... 

INFO 12:03:02,771 CountReads - CountReads counted 33 reads in the traversal 

To count the loci

java -jar $GATK -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam

Output:

INFO 12:06:57,211 HelpFormatter - ---------------------------------------------------------------------------------- 

INFO 12:06:57,213 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50

...

INFO 12:06:57,684 ProgressMeter - | processed | time | per 1M | | total | remaining 

INFO 12:06:57,684 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime 

2052 

INFO 12:06:57,772 ProgressMeter - done 2052.0 0.0 s 42.0 s 97.3% 0.0 s 0.0 s

...

Refer to HPC Guide to Genomics & HPC Software Guide for more information.