GATK CNV Toolchain in Firehose and FAQ Broad Internal

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by LeeTL1220

on 2016-06-15

We have put the GATK4 Somatic CNV Toolchain into Firehose. Please copy the below workflows from ``Algorithm_Commons``

```

GATK_Somatic_CNV_Toolchain_Capture

GATK_Somatic_CNV_Toolchain_WGS

```

For questions and discussions, not just specific to Firehose, please see the GATK 4 forum: http://gatkforums.broadinstitute.org/gatk/categories/gatk-4-alpha

Who do I contact with an issue?

First, make sure that your question is not here or in another forum post.

If it is a Firehose issue or you are not sure, email ``pipeline-help@broadinstitute.org``.

If you are sure that it is an issue with GATK CNV, ACNV, or GetBayesianHetPulldown, post to the forum.

What is GATK CNV vs. ACNV and which are run in the workflows above?

- GATK CNV estimates total copy ratio and performs segmentation and (basic) event calling. This tool works very similarly to ReCapSeg (for now).

- GATK ACNV creates credible intervals for copy ratio and minor allelic fraction (MAF). Under the hood, this tool is very different from Allelic CapSeg, but it can produce a file that can be ingested by ABSOLUTE (i.e. file is in same format produced by Allelic CapSeg)

- Both GATK CNV and ACNV are in the workflows above.

Are the results (e.g. sensitivity and precision) better than ReCapSeg in the GATK CNV toolchain?

If you talk about running without the allelic integration, then the results are equivalent. If you want more details, ask in the forum or invite us to talk to you — we have a presentation or two about this topic.

Do I run these workflows on Pair Sets or Individual Sets?

Individual Sets

What entity types do the tasks run on?

Samples and Pairs. I realize that the above question says to run the workflow on Individual Sets. This is to work around a Firehose issue.

What are the caveats around WGS?

- The total copy number tasks (similar to ReCapSeg) take about a tenth of the time as ReCapSeg, assuming good NFS performance. This is a good thing.

- The allelic tasks (GetBayesianPulldown and Allelic CNV) take a very long time to run. Over a day of runtime is not uncommon. In the next version of the GATK4 CNV Toolchain, we will have addressed this issue, but due to dispatch limitations, Firehose may not be able to fully capitalize on these improvements.

- The runtimes in general are very very sensitive to the filesystem performance.

- The results still have the same oversegmentation issues that you will see in ReCapSeg. There is a GC correction tool, but this has not been integrated into the Firehose workflow.

- There is a bug in a third-party library that limits the size of a PoN. This is unlikely to be an issue for capture, but can become a problem for WGS. For more details, please see gatkforums.broadinstitute.org/gatk/discussion/7594/limits-on-the-size-of-a-pon

What about the future of ReCapSeg?

We are phasing out ReCapSeg, for many reasons, everywhere — not just Firehose. If you would like more details, post to the forum and we’ll respond.

What about the future of Allelic CapSeg?

We have never supported (and never will support) Allelic CapSeg and cannot answer that question. We have some results comparing Allelic CapSeg and GATK ACNV. We can show you if you are interested (internal to Broad only).

Why are there fewer plots than in ReCapSeg?

We did not include plots that we did not believe were being used. If you would like to include additional plots, please post to the forum.

How is the GATK 4 CNV toolchain workflow better than the ReCapSeg workflow?

1) Faster. On exome, ReCapSeg takes ~105 minutes per case sample. GATK CNV takes < 30 minutes. Both time estimates assume good performance of NFS filesystem.

2) The workflows above include allelic integration results, from the tool GATK ACNV. These results are analogous to what Allelic CapSeg produces.

3) The workflow above produces results compatible with ABSOLUTE and TITAN. I.e. the results can be used as input to ABSOLUTE or TITAN.

4) All future improvements and bugfixes are going into GATK, not ReCapSeg. And many improvements are coming….

5) The workflows produce germline heterzygous SNP call files.

6) The ReCapSeg WGS workflow no longer works.

Are there new PoNs for these workflows?

Yes, but the PoN locations are already populated, if you run the workflows properly. Users do not need to do anything.

Is the correct PoN automatically selected for ICE vs. Agilent samples?

Yes, if you run the workflow.

Is there a PoN creation workflow in Firehose?

No. Never going to happen. Don’t ask. See the forum for instructions (and a Queue workflow) to create PoNs.

Can I run ABSOLUTE from the output of GATK ACNV?

Yes. The annotations are ``gatk4cnv_acnv_acs_seg_file_capture`` (capture) and ``gatk4cnv_acnv_acs_seg_file_wgs`` (WGS).

Can I run TITAN from the output of GATK ACNV?

Yes, though there has been little testing. The annotations are ``gatk4cnv_acnv_acs_seg_file_capture`` and ``gatk4cnv_acnv_acs_seg_file_wgs``.

Do the workflows above include Oncotator gene lists?

Yes.

Is the GATK4 CNV Toolchain in alpha?

Technically, the whole GATK4 is in alpha, but that includes more than just the GATK CNV toolchain. We are confident that the version in the workflows above produce high quality results. Please tell us if you find otherwise!

These workflows have Picard Target Mapper. Isn’t that going to cause me to have to rerun all of my jobs (e.g. MuTect)?

The workflows above will rerun Picard Target Mapper, but only new annotations are added. All previous output annotations of Picard Target Mapper should be populated with the same values. This will look as if it outdated mutation calling (MuTect) and other tasks, but the rerunning will be job avoided.

Can I do the tumor-only GATK ACNV workflow?

For exome that is working well, but is not available in Firehose. If you would like to see evaluation data for tumor-only on exome, we can show you (internal to Broad only). Please contact us if you need this in Firehose and we will work with you to set it up.

What are all of the annotations produced?

Where applicable, each of the list below also has a ``*_wgs`` counterpart…

Sample annotations:

- gatk4cnv_seg_file_capture — seg file of GATK CNV. This file is analogous to the ReCapSeg seg file.

- gatk4cnv_tn_file_capture — tangent normalized (denoised) target copy ratio estimates of GATK CNV. This file is analogous to the ReCapSeg tn file.

- gatk4cnv_pre_tn_file_capture — coverage profile (i.e. target copy ratio estimates without denoising) of GATK CNV. This file is analogous to the ReCapSeg tn file.

- gatk4cnv_betahats_capture — Tangent normalization coefficients used in the projection. This is in the weeds.

- gatk4cnv_called_seg_file_capture — output called seg file of GATK CNV. This file is analogous to the ReCapSeg called seg file.

- gatk4cnv_oncotated_called_seg_file_capture — gene list file generated from the GATK CNV segments

- gatk4cnv_dqc_capture (coming later) — measure of noise reduction in the tangent normalization process. Lower is better.

- gatk4cnv_preqc_capture (coming later) — measure of noise before tangent normalization

- gatk4cnv_postqc_capture (coming later) — measure of noise after tangent normalization

- gatk4cnv_num_seg_capture (coming later) — number of segments in the GATK CNV output

Pair annotations:

- gatk4cnv_case_het_file_capture — het pulldown file for the tumor sample in the pair.

- gatk4cnv_control_het_file_capture — het pulldown file for the normal sample in the pair.

- gatk4cnv_acnv_seg_file_capture — ACNV seg file with confidence intervals for copy ratio and minor allelic fraction.

- gatk4cnv_acnv_acs_seg_file_capture — ACNV seg file in a format that looks as if it was produced by AllelicCapSeg. Any segments called as “balanced” will be pegged to a MAF of 0.5. This file is ready for ingestion by ABSOLUTE.

- gatk4cnv_acnv_cnv_seg_file_capture — ACNV seg file in a format that looks as if it was produced by GATK CNV

- gatk4cnv_acnv_titan_het_file_capture — het file in a format that can be ingested by TITAN

- gatk4cnv_acnv_titan_cr_file_capture — target copy ratio estimates file in a format that can be ingested by TITAN

- gatk4cnv_acnv_cnloh_balanced_file_capture — ACNV seg file with calls for whether a segment is balanced or CNLoH (or neither).

Do the workflows also run on the normals?

GATK CNV, yes.

GATK ACNV, no.

There is a het pulldown generated for the normal, as a side effect, when doing the het pulldown for the tumor.

What about array data?

The GATK4 CNV tools do not run on array data. Sequencing data only.

Do we still need separate PoNs if we want to run on X and Y?

Yes.

Can I run both the ReCapSeg workflow and the GATK CNV toolchain workflow?

Yes. All results are written to separate annotations.

Are the new workflows part of my PrAn?

No, not yet. You will need to copy (and run) these manually from ``Algorithm_Commons`` before you begin analysis. As a reminder, copy into your analysis workspace.

Does GATK CNV require matched (tumor-normal) samples?

No.

Does GATK ACNV require matched (tumor-normal) samples?

In Firehose, yes. Out of Firehose, no.

How do I modify the ABSOLUTE tasks in FH to accept the new GATK ACNV annotations?

There are two changes you need to make to the ABSOLUTE_v1.5_WES configuration to make it accept the new outputs.

1) replace alleliccapseg_tsv with gatk4cnv_acnv_acs_seg_file_capture in the inputs

2) replace alleliccapseg_skew with 0.9883274, and change the annotation type to “Literal” instead of “Simple Expression”

This answer thanks to Dimitri Livitz, Daniel Rosebrock, and David Kwiatkowski.

Updated on 2016-06-23

From breardon on 2016-07-13

Re: Do we still need separate PoNs if we want to run on X and Y? Yes.

Since the GATK4CNV bed files / PoNs do not contain X & Y, are there any “GATK supported” bed files / PoNs that contain this information? If not, totally fine :)

p.s. loving GATK CNV

From LeeTL1220 on 2016-08-02

@breardon Coming soon…

From jjiao on 2017-01-18

Thanks for this super helpful documentation! I’ve run “GATK 4 ACNV Get Bayesian Hets Using Tumor and Normal for Capture” on FH a few times without problems, but they seem to be hanging on a more recent outside dataset. It’s been five days and there appears to be no output in the tmp directory. I’m wondering if this is a known problem—perhaps I’m missing some file? Thanks in advance!

From slee on 2017-01-19

@jjiao We’ve heard that this occasionally happens when running GetBayesianHetCoverage, but have yet to track down the exact cause. Is this happening for all of your samples in the dataset or just a few? Can you see any obviously pathological sites (perhaps with high read count) at which the tool is hanging? Any information you can provide (ideally, a BAM snippet containing some bad sites) would definitely help us understand and fix this bug!

In the meantime, perhaps you could use GetHetCoverage as a fallback. This tool uses a relatively naive frequentist test to genotype germline hets, but should give comparable results.

From Sahar90 on 2017-01-23

Thank you for this clear and detailed documentation.

Is the no-normal het pulldown for “GATK 4 ACNV Get Bayesian Hets Using Tumor and Normal for Capture” in Firehose? If not, are there configuration changes I can make to not use normal samples?

From LeeTL1220 on 2017-01-23

@Sahar90 I think the short answer is “no”. But it depends on your Firehose experience…

You would have to create a new module, create a new task configuration, change the command line in the hydrant.deploy, and create a new workflow.

From Sahar90 on 2017-01-26

@LeeTL1220 Thanks. What exact changes need to be made to the command line in the hydrant.deploy?

From LeeTL1220 on 2017-01-30

@Sahar90 You need to make a new module. Since you seem to be a Broadie, you should go to Firehose office hours. There are many steps. Just make it clear that you need to create a copy of the module and then make changes, not change the existing one.

See the forum for exact command line, but you can start with ``java -jar gatk-protected.jar GetBayesianHetCoverage —help`` Also, see the forum post http://gatkforums.broadinstitute.org/gatk/discussion/7719/overview-of-getbayesianhetcoverage-for-heterozygous-snp-calling