Getting started with GATK4

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2018-01-07

GATK, pronounced "Gee Ay Tee Kay" (not "Gat-Kay"), stands for GenomeAnalysisToolkit. It is a collection of command-line tools for analyzing high-throughput sequencing data with a primary focus on variant discovery. The tools can be used individually or chained together into complete workflows. We provide end-to-end workflows, called GATK Best Practices, tailored for specific use cases.

Starting with version 4.0, GATK contains a copy of the Picard toolkit, so all Picard tools are available from within GATK itself and their documentation is available in the Tool Documentation section of this website.

Contents

0-0. Preview the pipelines

Quick start for the impatient
Requirements
Get GATK
Install it
Test that it works
Run GATK and Picard commands
Learn the Best Practices
Run pipelines
Get help
Subscribe to forum notifications

0-0. Preview the pipelines

If you don't yet know for sure you're actually going to use GATK for your work, here's a tip for test-driving the software without having to do any real work yourself.

We're using a cloud platform called Terra to make it easier to get started with GATK. We've set up all our Best Practices pipelines in preconfigured workspaces so you can poke at them, see how they work and examine the results they produce on example data. You can also upload your own data (privately and securely) to test how the pipelines perform on that.

We're complementing this with new tutorials that use Jupyter Notebooks, also in Terra, to walk you through the logic, operation and results of each step of the pipeline. We've already been using this approach in our popular workshop series with very encouraging results, and going forward we're planning to offer all of our tutorials as Jupyter Notebooks.

You can read more about how and why you can get started with GATK on Terra in this series of blog posts:

If you end up liking it, you can even adopt Terra to do all your work, but we don't expect it to be a fit for everyone. This just feels like the best way we can empower you to try out our tools and test new releases without having to put in a ton of effort up front.

1. Quick start for the impatient

Run on Linux or MacOSX; MS Windows is not supported.
Make sure you have Java 8 / JDK 1.8 (Oracle or OpenJDK, doesn't matter).
Download the GATK package here OR get the Docker image here.
There are two jars because of reasons, but don't worry about; see the next point.
Invoke GATK through the gatk wrapper script rather than calling either jar directly
Basic syntax is gatk [--java-options "-Xmx4G"] ToolName [GATK args]; full details here.
Use the Terra workspaces to test the pipelines and learn what each tool does.
If you need help, read the User Guide and ask questions on the forum.

2. Requirements

Most GATK4 tools have fairly simple software requirements: a Unix-style OS and Java 1.8. However, a subset of tools have additional R and/or Python dependencies. These dependencies (as well as the base system requirements) are described in detail here. So we strongly recommend using the Docker container system, if that's an option on your infrastructure, rather than a custom installation. All released versions of GATK4 can be found as prepackaged container images in Dockerhub here. If you can't use Docker, do yourself a favor and use the Conda environment that we provide to manage dependencies, as described in the github repository README. If you run into a pip error and also recently updated your Mac OS, then see this solution.

You will also need Python 2.6 or greater to run the gatk wrapper script (described below).

If you run into difficulties with the Java version requirement, see this article for help.

3. Get GATK

You can download the GATK package here OR get the Docker image here. The instructions below will assume you downloaded the GATK package to your local machine and are planning to run it directly. For instructions on how to go the Docker route, see this tutorial.

Once you have downloaded and unzipped the package (named gatk-[version]), you will find four files inside the resulting directory:

gatk gatk-package-[version]-local.jar gatk-package-[version]-spark.jar README.md

Now you may ask, why are there two jars? As the names suggest, gatk-package-[version]-spark.jar is the jar for running Spark tools on a Spark cluster, while gatk-package-[version]-local.jar is the jar that is used for everything else (including running Spark tools "locally", i.e. on a regular server or cluster).

So does that mean you have to specify which one you want to run each time? Nope! See the gatk file in there? That's an executable wrapper script that you invoke and that will choose the appropriate jar for you based on the rest of your command line. You could still invoke a specific jar if you wanted, but using gatk is easier, and it will also take care of setting some parameters that you would otherwise have to specify manually.

4. Install it

There is no installation necessary in the traditional sense, since the precompiled jar files should work on any POSIX platform that satisfies the requirements listed above. You'll simply need to open the downloaded package and place the folder containing the jar files and launch script in a convenient directory on your hard drive (or server filesystem). Although the jars themselves cannot simply be added to your PATH, you can do so with the gatk wrapper script. Please look up instructions depending on the terminal shell you use; in bash the typical syntax is export PATH=$PATH:/path/to/gatk-package/gatk where path/to/gatk-package/ is the path to the location of the gatk executable. Note that the jars must remain in the same directory as gatk for it to work.

5. Test that it works

To test that you can successfully invoke the GATK, run the following command in your terminal application. Here we assume that you have added gatk to your PATH as recommended above

./gatk --help

This should output a summary of the invocation syntax, options for listing tools and invoking a specific tool's help documentation, and main Spark options if applicable.

6. Run GATK and Picard commands

Available tools are listed and described in some detail in the Tool Documentation section, along with available options. The basic syntax for invoking any GATK or Picard tool is the following:

gatk [--java-options "jvm args like -Xmx4G go here"] ToolName [GATK args go here]

So for example, a simple GATK command would look like:

gatk --java-options "-Xmx8G" HaplotypeCaller -R reference.fasta -I input.bam -O output.vcf

You can find more information about GATK command-line syntax here.

Syntax for Picard tools

When used from within GATK, all Picard tools use the same syntax as GATK. The conversion relative to the "Picard-style" syntax is very straightforward; wherever you used to do e.g. I=input.bam, you now do -I input.bam. So for example, a simple Picard command would look like:

gatk ValidateSamFile -I input.bam -MODE SUMMARY

7. Learn the Best Practices

The GATK Best Practices are end-to-end workflows that are meant to provide step-by-step recommendations for performing variant discovery analysis in high-throughput sequencing (HTS) data. We have several such workflows tailored to project aims (by type of variants of interest) and experimental designs (by type of sequencing approach). And although they were originally designed for human genome research, the GATK Best Practices can be adapted for analysis of non-human organisms of all kinds, including non-diploids.

The documentation for the Best Practices includes high-level descriptions of the processes involved, various types of documents that explain deeper details and adaptations that can be made depending on constraints and use cases, a set of actual pipeline implementations of these recommendations, and perhaps the most important, workshop materials including slide decks, videos and tutorials that walk you through every step.

8. Run pipelines

Most of the work involved in processing sequence data and performing variant discovery can be automated in the form of pipeline scripts, which often include some form of parallelization to speed up execution. We provide scripted implementations of the GATK Best Practices workflows plus some additional helper/accessory scripts in order to make it easier for everyone to run these sometimes rather complex workflows.

These workflows are written in WDL and intended to be run on any platform that supports WDL execution. Options are listed in the Pipelining section of the User Guide. Our preferred option is the Cromwell execution engine, which like GATK is also developed by the Broad's Data Sciences Platform (DSP), and is available as a service on our cloud platform, Terra (formerly known as FireCloud).

If you choose to run GATK workflows through Terra, you don't really need to do any of the above, since everything is already preloaded in a ready-to-run form (the software, the scripts, even some example data). At this point Terra the easiest way to run the workflows exactly as we do in our own work. As noted above, we've set up all our Best Practices pipelines in preconfigured workspaces on a cloud platform called Terra, so you can try them out without having to do any setup. You can compare the results to other pipelines and also upload your own data (privately and securely) to test how our pipelines perform on that. Read this blog post to learn more about this resource.

9. Get Help

We provide all support through our very active community forum. You can ask questions and report any problems that you might encounter, with the following guidelines:

Before asking for help

Before posting to the Forum, please do the following:

Use the Search box in the top-right corner of every page -- it will search everything including the User Guide and the Forum.
If something is not working:

When asking for help

When asking a question about a problem, please include the following:

All version information (GATK version, Java, your operating system if possible).
Don't just tell us you are following the Best Practices -- describe exactly what you are doing.
Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
For tool errors, include the full command you ran AND the stacktrace (_i.e. the long pile of unreadable software gobbledygook in the terminal output) if there is one.
For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
For weird/unexpected results, include an illustrative example, e.g. attach IGV screenshots, and explain in detail why you think the result is weird -- especially if you're working with non-human data. We may not be aware of your organism's quirks.

We will typically get back to you with a response within one or two business days, but be aware that more complex issues (or unclear reports) may take longer to address. In addition, some times of the year are especially busy for us and we may take longer than usual to answer your question.

We may ask you to submit a formal bug report, which involves sending us some test data that we can use to reproduce the problem ourselves. This is often required for debugging. Rest assured we treat all data transferred to us as private and confidential. In some cases we may ask for your permission to include a snippet of your test case in our testing framework, which is publicly accessible. In such a case, YOU are responsible for verifying with whoever owns the data whether you are authorized to allow us to make that data public.

Note that the information in this documentation guide is targeted at end-users. For developers, the source code and related resources are available on GitHub.

10. Subscribe to forum notifications

Consider subscribing to forum notifications and announcements so you'll get an email when we answer your questions and when we post new content to the blog, which is the best way to stay informed of new features and opportunities. For instructions, see https://software.broadinstitute.org/gatk/documentation/article?id=11026.

Updated on 2019-07-14

From stachyra on 2019-06-18

Under “1.) Quick start for the impatient / Make sure you have Java 8 / JDK 1.8 (Oracle or OpenJDK, doesn’t matter)” I suggest adding three sub-bullets.

First, the impatient will tend to mis-read this bullet as “JDK 8 or greater”, as that’s usually how software dependency compatibilities work in the large majority of cases. But if it really has to be JDK 8 exactly (i.e., both earlier and later versions are incompatible) then it may be helpful to call attention to this fact more explicitly. Given the most recent version from Oracle is JDK 12, this is what most users will tend to have installed on their system by default, and it’s what they’ll tend to try first, resulting in errors and confusion.

For the second bullet, maybe a one-line suggestion for how the user can get his or her java installation to fall back to an earlier version, even though a newer version may be the default? In bash on MacOS with OpenJDK, I found myself using “export JAVA_HOME=`/usr/libexec/java_home -v 1.8`”, however I had to search around for quite a while in order to arrive at this solution, as it’s not something I’ve ordinarily needed to do in the past (most apps tend to be compatible with the most recent version of java).

Finally, for some non-linux systems, OpenJDK implementations of Java8 are a bit of hassle to find. For MacOS, I ended up downloading from AdoptOpenJDK but when you search “OpenJDK” on google, AdoptOpenJDK is actually the 5th or 6th hit down the list, and it’s not obvious that’s the right place to go. Given that Java 8 is a critical dependency for the tool, it would be helpful to know Broad’s recommendations where to obtain it, for all operating systems.

From SkyWarrior on 2019-06-20

AFAIK the recommendation is to use the docker image which already contains all the dependencies along with the python environment.

Report abuse