created by shlee
on 2019-01-31
This guide introduces select elements of the broadinstitute/gatk GitHub repository to researchers on the GATK forum who we have pointed to the repo for any variety of reasons and who are unfamiliar with GitHub.
The labels in the screenshot number the seven elements this article covers.
Understanding the first three elements (Sections 1–3) should enable researchers to (i) interpret for example the status of a feature request or bug fix for a particular GATK release version and (ii) be involved in the discussion that drives GATK development forward.
The remaining four elements (Sections 4-7) are of interest to those who wish to read about the mathematics behind GATK algorithms, view versioned WDL-format pipelines for workflows under recent development, learn how to use engine features, e.g. streaming from Google Cloud Storage, and build GATK from the sourcecode.
docs
: Mathematical whitepapers on select algorithmsscripts
: Tested versioned WDL pipeline scriptsREADME.md
: Instructions to build and run GATK in the required environmentIssue tickets are where discussion happens and where plans are set to make changes to the codebase.
Just because an issue ticket discusses plans or has a Closed status, does not necessarily mean the GATK has or will implement that discussed within. Skim the discussion and look for associated pull requests, which are often referred to as PRs, and their status (screenshot below). If you are unclear on any point, ask for clarification by writing a comment in the issue ticket. You will need a GitHub account and be signed in to do so.
Here's an example issue ticket where the community drove the implementation of a feature, specifically the --include-non-variant-sites
option of GenotypeGVCFs: https://github.com/broadinstitute/gatk/issues/2865.
Read the discussion in the pull request and any associated issue ticket for specifics on the changes.
Here's an example pull request that pairs with the previous example issue ticket: https://github.com/broadinstitute/gatk/pull/5219.
In the overview screenshot we see 35 releases for GATK4. The releases page presents releases in reverse-chronological order, so the latest release is at top.
/path/to/gatk-4.x.x.x/gatk --list
into a terminal prompt will list the available tools in the toolkit as well as their production status, whether experimental EXPERIMENTAL Tool
, in beta testing BETA Tool
, or fit for production (no label).The branch is set to master by default, which reflects the latest development to the broadinstitute/gatk codebase. To view a snapshot of the code for a particular version of GATK, click the Branch button, then switch to the Tags tab. Selecting a tag version, e.g. 4.0.0.0, will allow you to travel back in time to the codebase as it looked for that particular release. This is useful, e.g. if you are looking for WDL pipeline scripts that work for past versions of GATK4 (see Section 6).
The PDFs within this folder and subfolders outline the mathematics behind select GATK algorithms. If the GATK forum seems sparse on mathematical details, that is because it is not set up to display complex LaTeX equations. The whitepapers are provided by the generosity of GATK methods developers. Be sure to take into consideration the datestamps associated with the articles, as development takes priority over documentation and the mathematical details can fall behind the latest algorithmic improvements.
For certain GATK4 workflows, the developers maintain working WDL pipeline scripts for every release. See Section 4 for instructions on accessing tagged versioned scripts.
Take for example the mutect2wdl directory. It contains pipeline scripts for creating a Mutect2 PoN, for running Mutect2 on a tumor-normal pair, etc. The view will show the development or _master codebase by default. The following portions of the highlighted script illustrate a difference between the v4.0.0.0 and the v4.1.0.0 WDLs, for each workflow's invocation of their respective M2 tasks.
Notice the URL elements that differ--the tag version and the highlighted lines. We see the latter pipeline defines a number of additional parameters, e.g. artifact_prior_table
, that are not present in the earlier pipeline. If we check the details of the respective M2 tasks, then we also see differences. In this way, if you are testing out workflows using broadinstitute/gatk repository WDL scripts, you should be sure to match to the version of the toolkit.
The README.md is a document that the repository landing page displays, below the list of folders and files. For the broadinstitute/gatk repository, it presents a plethora of information that a Table of Contents at top organizes.
Of interest to researchers are the following sections.
Updated on 2019-02-01