Are there any Broadspecific instructions for using GATK

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by GATK_Team

on 2017-12-29

In general you should use FireCloud, which has all the major GATK workflows preloaded, is more scalable and makes it easier to share any work you do with external collaborators, since the portal is publicly accessible and you can grant anyone access to workspaces securely and conveniently.

However, there are a couple of few Broad-internal resources that you can use if FireCloud is not yet a suitable option for you.

    1. Dotkits for running GATK CNV and ACNV
    2. GATK CNV Toolchain in Firehose

1. Dotkits for running GATK CNV and ACNV

The following dotkits should load all the necessary dependencies:

use .hdfview-2.9 use Java-1.8 use .r-3.1.3-gatk-only

If these don't work, move to a VM where the dotkits are not broken. If that still doesn't work, go to FireCloud.

2. GATK CNV Toolchain in Firehose

We make this available as a courtesy, but we will not be able to provide support for any Firehose-specific aspects. Note that Firehose will be phased out at some point in 2018, and you will need to move your work to FireCloud by then. Rest assured we will provide support for the migration (phase-out calendar TBD).

We have put the GATK4 Somatic CNV Toolchain into Firehose. Please copy the below workflows from Algorithm_Commons:

GATK_Somatic_CNV_Toolchain_Capture GATK_Somatic_CNV_Toolchain_WGS

Frequently asked questions:

Who do I contact with an issue?

First, make sure that your question is not here or in another forum post. If it is a Firehose issue or you are not sure, email pipeline-help@broadinstitute.org. If you are sure that it is an issue with GATK CNV, ACNV, or GetBayesianHetPulldown, post to the forum.

What is GATK CNV vs. ACNV and which are run in the workflows above?

    • GATK CNV estimates total copy ratio and performs segmentation and (basic) event calling. This tool works very similarly to ReCapSeg (for now).
    • GATK ACNV creates credible intervals for copy ratio and minor allelic fraction (MAF). Under the hood, this tool is very different from Allelic CapSeg, but it can produce a file that can be ingested by ABSOLUTE (i.e. file is in same format produced by Allelic CapSeg)
    • Both GATK CNV and ACNV are in the workflows above.

Are the results (e.g. sensitivity and precision) better than ReCapSeg in the GATK CNV toolchain?

If you talk about running without the allelic integration, then the results are equivalent. If you want more details, ask in the forum or invite us to talk to you -- we have a presentation or two about this topic.

Do I run these workflows on Pair Sets or Individual Sets?

Individual Sets

What entity types do the tasks run on?

Samples and Pairs. I realize that the above question says to run the workflow on Individual Sets. This is to work around a Firehose issue.

What are the caveats around WGS?

    • The total copy number tasks (similar to ReCapSeg) take about a tenth of the time as ReCapSeg, assuming good NFS performance. This is a good thing.
    • The allelic tasks (GetBayesianPulldown and Allelic CNV) take a very long time to run. Over a day of runtime is not uncommon. In the next version of the GATK4 CNV Toolchain, we will have addressed this issue, but due to dispatch limitations, Firehose may not be able to fully capitalize on these improvements.
    • The runtimes in general are very very sensitive to the filesystem performance.
    • The results still have the same oversegmentation issues that you will see in ReCapSeg. There is a GC correction tool, but this has not been integrated into the Firehose workflow.
    • There is a bug in a third-party library that limits the size of a PoN. This is unlikely to be an issue for capture, but can become a problem for WGS. For more details, please see gatkforums.broadinstitute.org/gatk/discussion/7594/limits-on-the-size-of-a-pon

What is the future of ReCapSeg?

We are phasing out ReCapSeg, for many reasons, everywhere -- not just Firehose. If you would like more details, post to the forum and we'll respond.

What is the future of Allelic CapSeg?

We have never supported (and never will support) Allelic CapSeg and cannot answer that question. We have some results comparing Allelic CapSeg and GATK ACNV. We can show you if you are interested (internal to Broad only).

Why are there fewer plots than in ReCapSeg?

We did not include plots that we did not believe were being used. If you would like to include additional plots, please post to the forum.

How is the GATK 4 CNV toolchain workflow better than the ReCapSeg workflow?

    • Faster. On exome, ReCapSeg takes ~105 minutes per case sample. GATK CNV takes < 30 minutes. Both time estimates assume good performance of NFS filesystem.
    • The workflows above include allelic integration results, from the tool GATK ACNV. These results are analogous to what Allelic CapSeg produces.
    • The workflow above produces results compatible with ABSOLUTE and TITAN. I.e. the results can be used as input to ABSOLUTE or TITAN.
    • All future improvements and bugfixes are going into GATK, not ReCapSeg. And many improvements are coming....
    • The workflows produce germline heterzygous SNP call files.
    • The ReCapSeg WGS workflow no longer works.

Are there new PoNs for these workflows?

Yes, but the PoN locations are already populated, if you run the workflows properly. You should not need to do any set up yourself.

Is the correct PoN automatically selected for ICE vs. Agilent samples?

Yes, if you run the workflow as provided.

Is there a PoN creation workflow in Firehose?

No. Never going to happen. Don't ask. See the forum for instructions to create PoNs.

Can I run ABSOLUTE from the output of GATK ACNV?

Yes. The annotations are gatk4cnv_acnv_acs_seg_file_capture (capture) and gatk4cnv_acnv_acs_seg_file_wgs (WGS).

Can I run TITAN from the output of GATK ACNV?

Yes, though there has been little testing done on this. The annotations are gatk4cnv_acnv_acs_seg_file_capture and gatk4cnv_acnv_acs_seg_file_wgs.

Do the workflows above include Oncotator gene lists?

Yes.

These workflows include Picard Target Mapper. Isn't that going to cause me to have to rerun all of my jobs (e.g. MuTect)?

The workflows above will rerun Picard Target Mapper, but only new annotations are added. All previous output annotations of Picard Target Mapper should be populated with the same values. This will look as if it outdated mutation calling (MuTect) and other tasks, but the rerunning will be job-avoided.

Can I do the tumor-only GATK ACNV workflow?

For exome that is working well, but is not available in Firehose. If you would like to see evaluation data for tumor-only on exome, we can show you (internal to Broad only).

What are all of the annotations produced?

Where applicable, each of the list below also has a *_wgs counterpart... Sample annotations:

    • gatk4cnvsegfile_capture -- seg file of GATK CNV. This file is analogous to the ReCapSeg seg file.
    • gatk4cnvtnfile_capture -- tangent normalized (denoised) target copy ratio estimates of GATK CNV. This file is analogous to the ReCapSeg tn file.
    • gatk4cnvpretnfilecapture -- coverage profile (i.e. target copy ratio estimates without denoising) of GATK CNV. This file is analogous to the ReCapSeg tn file.
    • gatk4cnvbetahatscapture -- Tangent normalization coefficients used in the projection. This is in the weeds.
    • gatk4cnvcalledsegfilecapture -- output called seg file of GATK CNV. This file is analogous to the ReCapSeg called seg file.
    • gatk4cnvoncotatedcalledsegfile_capture -- gene list file generated from the GATK CNV segments
    • gatk4cnvdqccapture (coming later) -- measure of noise reduction in the tangent normalization process. Lower is better.
    • gatk4cnvpreqccapture (coming later) -- measure of noise before tangent normalization
    • gatk4cnvpostqccapture (coming later) -- measure of noise after tangent normalization
    • gatk4cnvnumseg_capture (coming later) -- number of segments in the GATK CNV output

Pair annotations:

    • gatk4cnvcasehetfilecapture -- het pulldown file for the tumor sample in the pair.
    • gatk4cnvcontrolhetfilecapture -- het pulldown file for the normal sample in the pair.
    • gatk4cnvacnvsegfilecapture -- ACNV seg file with confidence intervals for copy ratio and minor allelic fraction.
    • gatk4cnvacnvacssegfile_capture -- ACNV seg file in a format that looks as if it was produced by AllelicCapSeg. Any segments called as "balanced" will be pegged to a MAF of 0.5. This file is ready for ingestion by ABSOLUTE.
    • gatk4cnvacnvcnvsegfile_capture -- ACNV seg file in a format that looks as if it was produced by GATK CNV
    • gatk4cnvacnvtitanhetfile_capture -- het file in a format that can be ingested by TITAN
    • gatk4cnvacnvtitancrfile_capture -- target copy ratio estimates file in a format that can be ingested by TITAN
    • gatk4cnvacnvcnlohbalancedfile_capture -- ACNV seg file with calls for whether a segment is balanced or CNLoH (or neither).

Do the workflows also run on the normals?

GATK CNV, yes.

GATK ACNV, no. There is a het pulldown generated for the normal, as a side effect, when doing the het pulldown for the tumor.

What about array data?

The GATK4 CNV tools do not run on array data. Sequencing data only.

Do we still need separate PoNs if we want to run on X and Y?

Yes.

Can I run both the ReCapSeg workflow and the GATK CNV toolchain workflow?

Yes. All results are written to separate annotations.

Are the new workflows part of my PrAn?

No, not yet. You will need to copy (and run) these manually from Algorithm_Commons before you begin analysis. As a reminder, copy into your analysis workspace.

Does GATK CNV require matched (tumor-normal) samples?

No.

Does GATK ACNV require matched (tumor-normal) samples?

In Firehose, yes. Out of Firehose, no.

How do I modify the ABSOLUTE tasks in FH to accept the new GATK ACNV annotations?

There are two changes you need to make to the ABSOLUTEv1.5WES configuration to make it accept the new outputs.

    • replace alleliccapsegtsv with gatk4cnvacnvacssegfilecapture in the inputs
    • replace alleliccapseg_skew with 0.9883274, and change the annotation type to "Literal" instead of "Simple Expression"

From eminikel on 2018-11-07

If I am working on premises on the Broad cluster, where can I access GenomeAnalysisTK.jar and not need to have my own installation?