Somatic short variant discovery SNVs Indels

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2018-01-07

Purpose

Identify somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample.

Reference Implementations

| Pipeline | Summary | Notes | Github | Terra | |:-------------|:---------------|:---------|:-----------|:--------------| | Somatic short variants tumor-normal pair | T-N BAMs to VCF | universal | yes | b37 | | Somatic short variants PON creation | Normal BAMs to PON | universal | yes | b37 |

Expected input

This workflow requires BAM files for each input tumor and normal sample. Input BAMs should be pre-processed as described in the GATK Best Practices for data pre-processing.

Main steps

There are two main steps to this workflow. First we generate a large set of candidate somatic variants, then we filter them to obtain a more confident set of somatic variant calls.

Call candidate variants

Tools involved: Mutect2

Like HaplotypeCaller, Mutect2 calls SNVs and indels simultaneously via local de-novo assembly of haplotypes in an active region. That is, when Mutect2 encounters a region showing signs of somatic variation, it discards the existing mapping information and completely reassembles the reads in that region in order to generate candidate variant haplotypes. Like HaplotypeCaller, Mutect2 then aligns each read to each haplotype via the Pair-HMM algorithm to obtain a matrix of likelihoods. Finally, it applies a Bayesian somatic likelihoods model to obtain the log odds for alleles to be somatic variants versus sequencing errors.

Calculate Contamination

Tools involved: GetPileupSummaries, CalculateContamination

This step emits an estimate of the fraction of reads due to cross-sample contamination for each tumor sample and an estimate of the allelic copy number segmentation of each tumor sample. Unlike other contamination tools, CalculateContamination is designed to work well without a matched normal even in samples with significant copy number variation and makes no assumptions about the number of contaminating samples.

Learn Orientation Bias Artifacts

Tools involved: LearnReadOrientationModel

This tool uses an optional F1R2 counts output of Mutect2 to learn the parameters of a model for orientation bias. It finds prior probabilities of single-stranded substitution errors prior to sequencing for each trinucleotide context. This is extremely important for FFPE tumor samples.

Filter Variants

Tools involved: FilterMutectCalls

Mutect2’s somatic likelihoods model assumes that read errors are independent, so that, for example, four reads each with an error probability of 1/1000 yield a log odds of roughly 1000^4 in favor of being a real variant versus a sequencing error. FilterMutectCalls accounts for correlated errors, that is, the possibility that all variant reads at a site were due to some common source of error. It accomplishes this through several hard filters to detect alignment artifacts and probabilistic models for strand and orientation bias artifacts, polymerase slippage artifacts, germline variants, and contamination. Additionally, it learns a Bayesian model for the overall SNV and indel mutation rate and allele fraction spectrum of the tumor to refine the log odds emitted by Mutect2. It then automatically sets a filtering threshold to optimize the F score, the harmonic mean of sensitivity and precision.

Annotate Variants

Tools involved: Funcotator

At this step we run tools to add information to the discovered variants in our dataset. One of those tools, Funcotator, can be used to add gene-level information to each variant. Funcotator is a functional annotation tool in the core GATK toolset and was designed to handle both somatic and germline use cases. Funcotator reads in a VCF file, labels each variant with one of twenty-three distinct variant classifications, produces gene information (e.g. affected gene, predicted variant amino acid sequence, etc.), and associations to information in datasources. Supported datasources include GENCODE (gene information and protein change prediction), dbSNP, gnomAD, and COSMIC (among others). The corpus of datasources is extensible and user-configurable and includes cloud-based datasources supported with Google Cloud Storage. Funcotator produces either a Variant Call Format (VCF) file (with annotations in the INFO field) or a Mutation Annotation Format (MAF) file.

Additional Information

Updated on 2019-06-03

From alongalor on 2018-02-08

> @Geraldine_VdAuwera said:

> A brand new version of these workflows is about to be released and will be made available within the next few days, along with the relevant documentation.

Just wanted to check in to see if comment is still relevant or if the new documentation has already been uploaded? Thanks so much!

From ehscholl on 2018-02-09

Also checking in to see if there are any updates.

From Rebecca_Donnelly on 2018-02-09

How do i reference this picture if using it?

From shlee on 2018-02-09

alongalor ehscholl:

For GATK4 Mutect2 related links, see https://software.broadinstitute.org/gatk/blog?id=11337.

The exploratory tutorial is at https://software.broadinstitute.org/gatk/documentation/article?id=11136.

From Geraldine_VdAuwera on 2018-02-14

@Rebecca_Donnelly You can credit the figure to the Broad Institute Data Sciences Platform and link to [this page](https://software.broadinstitute.org/gatk/best-practices/workflow?id=11146).

From Geraldine_VdAuwera on 2018-02-14

alongalor and ehscholl: these doc pages are trailing a bit behind the state of the workflows themselves, sorry. We plan to have more comprehensive overview-level docs here than are currently available (see the germline short variants for a preview of what we’re aiming for) but for now your best bet is to check out the more detailed docs that @shlee referenced above.

From hashish on 2018-04-09

I read in the forums somewhere that the workflows are coming out in April, any updates?

From Sheila on 2018-04-09

@hashish

Hi,

Soo Hee published a [blog](https://software.broadinstitute.org/gatk/blog?id=11337) with links to all Mutect2 related articles.

-Sheila

From hashish on 2018-04-10

Thank you Sheila, I was namely asking if there was a detailed Mutect2 best practice document similar to that of the germline (as mentioned by Geraldine_VdAuwera ).

From Geraldine_VdAuwera on 2018-04-10

@hashish Not yet, we’re working on it.

From woodwordf_aa on 2018-08-24

Why is task CollectSequencingArtifactMetrics deprecated ?

I noticed that in mutect2.wdl task CollectSequencingArtifactMetrics and option run_orientation_bias_filter are deprecated.

Is it because that step is not necessary any more or you have a better tool to replace gatk CollectSequencingArtifactMetrics ?

From Sheila on 2018-09-05

@woodwordf_aa

Hi,

There is a better tool to replace that step which should be out very soon.

-Sheila

From dario_romagnoli on 2018-11-14

I remember reading that after creating one Panel of Normal it was possible to add more samples to the panel without including all the previously used normal samples. Is this feature still available?

From dario_romagnoli on 2018-11-15

I have a question regarding the wdl workflow. How can I limit the number of core and memory used? I’m running it locally on a server with 40 cores and 500 GB. The process of creating the Panel of Normal (with 2 samples) quickly goes up to 400GB and counting.

From KKND on 2019-05-29

Why do I need a panel of normal for normal-tumor paired samples?

From Angry_Panda on 2019-07-31

Dear gatk team,

I have a question about mutect2.wdl running time.

I ran the mutect2.wdl with mutect2.exome.inputs.json (provided in your github page) in my own VM (24 cores, 120g RAM) by modifying the input files in local path.

It successfully finished in 11mins. generated `HCC1143-filtered.vcf` sized 579k.

Does this is reasonable result or super wired? In terra, i saw it cost around 2 hours.

Do we have some benchmark data for somatic snvs + indels workflow?

From Angry_Panda on 2019-08-01

Do we have some detailed introduction about difference around nio vs non-nio version? I tried mutect2_nio.wdl with mutect2.exome.inputs.json (provided in the github page) in my own VM (24 cores, 120g RAM) by modifying the input files in local path.

but failed, checked its stderr: [August 1, 2019 1:10:13 PM UTC] org.broadinstitute.hellbender.tools.GetSampleName done. Elapsed time: 0.02 minutes. Runtime.totalMemory()=1961361408

A USER ERROR has occurred: The specified fasta file (file:///home/cloud-user/gatk4-somatic-snvs-indels/inputs/Homosapiensassembly19.fasta) does not exist.

Set the system property GATKSTACKTRACEONUSEREXCEPTION (--java-options '-DGATKSTACKTRACEONUSEREXCEPTION=true') to print the stack trace. Using GATK jar /root/gatk.jar defined in environment variable GATKLOCALJAR Running: java -Dsamjdk.useasyncioreadsamtools=false -Dsamjdk.useasynciowritesamtools=true -Dsamjdk.useasynciowritetribble=false -Dsamjdk.compressionlevel=2 -Xmx3000m -jar /root/gatk.jar GetSampleName -R /home/cloud-user/gatk4-somatic-snvs-indels/inputs/Homosapiensassembly19.fasta -I /home/cloud-user/gatk4-somatic-snvs-indels/inputs/HCC1143.bam -O tumorname.txt -encode

Actually, the fasta file did exist. Do I need to change the "Dsamjdk.useasyncioreadsamtools=false" to "true" ?

Report abuse