created by Geraldine_VdAuwera
on 2014-10-02
Overview
This document describes how GATK commands are structured and how to add arguments to basic command examples.
Commands for GATK always follow the same basic syntax:
java [Java arguments] -jar GenomeAnalysisTK.jar [GATK arguments]
The core of the command is java -jar GenomeAnalysisTK.jar
, which starts up the GATK program in a Java Virtual Machine (JVM). Any additional java-specific arguments (such as -Xmx to increase memory allocation) should be inserted between java
and -jar
, like this:
java -Xmx4G -jar GenomeAnalysisTK.jar [GATK arguments]
The order of arguments between java
and -jar
is not important.
There are two universal arguments that are required for every GATK command (with very few exceptions, the clp
-type utilities), -R
for Reference (e.g. -R human_b37.fasta
) and -T
for Tool name (e.g. -T HaplotypeCaller
).
Additional arguments fall in two categories:
-L
(for specifying a list of intervals) which can be given to all tools and are technically optional but may be effectively required at certain steps for specific analytical designs (e.g. the -L
argument for calling variants on exomes);-I
(to provide an input file containing sequence reads to tools that process BAM files) or optional, like -alleles
(to provide a list of known alleles for genotyping).The ordering of GATK arguments is not important, but we recommend always passing the tool name (-T
) and reference (-R
) first for consistency. It is also a good idea to consistently order arguments by some kind of logic in order to make it easy to compare different commands over the course of a project. It’s up to you to choose what that logic should be.
All available engine and tool-specific arguments are listed in the tool documentation section. Arguments typically have both a long name (prefixed by --
) and a short name (prefixed by -
). The GATK command line parser recognizes both equally, so you can use whichever you prefer, depending on whether you prefer commands to be more verbose or more succinct.
Finally, a note about flags. Flags are arguments that have boolean values, i.e. TRUE or FALSE. They are typically used to enable or disable specific features; for example, --keep_program_records
will make certain GATK tools output additional information in the BAM header that would be omitted otherwise. In GATK, all flags are set to FALSE by default, so if you want to set one to TRUE, all you need to do is add the flag name to the command. You don't need to specify an actual value.
This is a very simple command that runs HaplotypeCaller in default mode on a single input BAM file containing sequence data and outputs a VCF file containing raw variants.
java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf
If the data is from exome sequencing, we should additionally provide the exome targets using the -L
argument:
java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf -L exome_intervals.list
If we just want to genotype specific sites of interest using known alleles based on results from a previous study, we can change the HaplotypeCaller’s genotyping mode using -gt_mode
, provide those alleles using -alleles
, and restrict the analysis to just those sites using -L
:
java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf -L known_alleles.vcf -alleles known_alleles.vcf -gt_mode GENOTYPE_GIVEN_ALLELES
For more examples of commands and for specific tool commands, see the tool documentation section.
Updated on 2014-10-02