How to Call somatic mutations using GATK4 Mutect2

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by bhanuGandham

on 2019-05-24

This tutorial is applicable to Mutect2 version 4.1.1.0 and higher. Post suggestions/questions in the Ask the GATK team section.

GATK 4.1.1.0 contains several major advances in the Mutect2 pipeline, which is good, but we have had to change command lines in a few places, which we usually try to avoid. Here are the major changes.

Bug Fixes

We fixed several bugs with error messages about invalid log probabilities, infinities, NaNs etc that were introduced by neglecting to account for finite precision errors. We also resolved an issue where CalculateContamination worked poorly on very small gene panels.

New Required Inputs to FilterMutectCalls

1) FilterMutectCalls now requires a reference, which should be the same fasta file input to Mutect2. 2) FilterMutectCalls also requires a stats file from Mutect2. When you run Mutect2 with output -O unfiltered.vcf, for example, it produces a file called unfiltered.vcf.stats. FilterMutectCalls automatically looks for this file, so if you are running locally no change is needed. That is,

gatk Mutect2 -R ref.fasta -I tumor.bam -O unfiltered.vcf

followed by

gatk FilterMutectCalls -R ref.fasta -V unfiltered.vcf -O filtered.vcf

works because Mutect2 creates unfiltered.vcf.stats behind the scenes and FilterMutectCalls knows to look for it. However, if you are running on a cluster or the cloud you need to keep track of the stats file. For example, you need to delocalize it from a VM, as is done in the Mutect2 WDL. You can explicitly input the stats file with the -stats argument in FilterMutectCalls. If you are scattering Mutect2 over multiple nodes you must merge the stats files with MergeMutectStats:

gatk MergeMutectStats -stats unfiltered1.vcf.stats -stats unfiltered2.vcf.stats -O merged.stats

and pass merged.stats to FilterMutectCalls.

New Filtering Strategy in FilterMutectCalls

FilterMutectCalls now filters based on a single quantity, the probability that a variant is a somatic mutation. Previously, different reasons why this might not be the case each had their own thresholds. We have removed parameters such as -normal-artifact-lod, -max-germline-posterior, -max-strand-artifact-probability, and -max-contamination-probability. Even the previously fundamental -tumor-lod is gone. Rather than replace these with a single threshold for the error probability, FilterMutectCalls replaces them with nothing at all. It automatically determines the threshold that optimizes the "F score", the harmonic mean of sensitivity and precision. In order to tweak results in favor of more sensitivity users may set -f-score-beta to a value greater than its default of 1 (beta is the relative weight of sensitivity versus precision in the harmonic mean). Setting it lower biases results toward greater precision.

You can think of these changes as doing two things. Unifying filtering thresholds puts all the filters at the same place on a ROC curve of sensitivity vs precision. Previously, one threshold might be sacrificing a lot of sensitivity for a small gain in precision while another might be doing the opposite, the result being poor sensitivity and precision that fell below the potential ROC curve. Once we’re on that ROC curve, we achieve a good balance between sensitivity and precision by optimizing the F score.

New Somatic Clustering Model in FilterMutectCalls

We had long suspected that modeling the spectrum of subclonal allele fractions would help distinguish somatic variants from errors. For example, if every somatic variant in a tumor were a het occurring in 40% of cells, we would know to reject anything with an allele fraction significantly different from 20%. In the Bayesian framework of Mutect2 this means that the likelihood for somatic variants is given by a binomial distribution. We account for an unknown number of subclones with a Dirichlet process binomial mixture model. This is still an oversimplification because CNVs, small subclones, and genetic drift of passenger mutations all contribute allele fractions that don’t match a few discrete values. Therefore, we include a couple of beta-binomials in the mixture to account for a background spread of allele fractions while still benefiting from clustering. Finally, we use these binomial and beta-binomial likelihoods to refine the tumor log odds calculated by Mutect2, which assume a uniform distribution of allele fractions.

In addition to clustering allele fractions we also learn the overall prior probabilities of somatic SNVs and indels so that we can be more skeptical of apparent variants in a quiet tumor, for example. We learn the parameters of this model with a stochastic EM approach, where the E steps consist of Chinese Restaurant Process sampling of the allele fraction clusters. In case you were wondering, we have tested this new approach on some of the off-label, non-cancer uses of Mutect2, such as mitochondria, and it works very well. FilterMutectCalls reports the learned parameters of somatic clustering in a new .filtering_stats.tsv output. This file also contains information on the probability threshold chosen to optimize the F score and the number of false positives and false negatives from each filter to be expected from this choice.

A step-by-step guide to the new Mutect2 Read Orientation Artifacts Workflow

If you suspect any of your samples of substitution errors that occur on a single strand before sequencing you should definitely use Mutect2's orientation bias filter. This applies to all FFPE tumor samples and samples sequenced on Illumina Novaseq machines, among others. In fact, with the optimizations in 4.1.1.0 you can run the filter even when you're not suspicious. It won't hurt accuracy and the CPU cost is now quite small.

There are three steps to the filter. First, run Mutect2 with the --f1r2-tar-gz argument. This creates an output with raw data used to learn the orientation bias model. Previously this was done by CollectF1R2Counts. By absorbing it into Mutect2, we eliminated the cost of CollectF1R2Counts with almost no effect on the runtime of Mutect2. When multiple tumor samples are specified, you only need a single --f1r2-tar-gz output, which contains data for each tumor sample.

gatk Mutect2 -R ref.fasta \ -L intervals.interval_list \ -I tumor.bam \ -germline-resource af-only-gnomad.vcf \ -pon panel_of_normals.vcf \ --f1r2-tar-gz f1r2.tar.gz \ -O unfiltered.vcf

Next, pass this raw data to LearnReadOrientationModel:

gatk LearnReadOrientationModel -I f1r2.tar.gz -O read-orientation-model.tar.gz

Run GetPileupSummaries to summarize read support for a set number of known variant sites.

gatk GetPileupSummaries \ -I tumor.bam \ -V chr17_small_exac_common_3_grch38.vcf.gz \ -L chr17_small_exac_common_3_grch38.vcf.gz \ -O getpileupsummaries.table

Estimate contamination with CalculateContamination.

gatk CalculateContamination \ -I getpileupsummaries.table \ -tumor-segmentation segments.table \ -O calculatecontamination.table

Finally, pass the learned read orientation model to FilterMutectCallswith the -ob-priors argument:

gatk FilterMutectCalls -V unfiltered.vcf \ [--tumor-segmentation segments.table] \ [--contamination-table contamination.table] \ --ob-priors read-orientation-model.tar.gz \ -O filtered.vcf

Advanced note: if you are scattering Mutect2 over nodes in a cluster or on the cloud, you must input the --f1r2-tar-gz output from each Mutect2 scatter to LearnReadOrientationModel. This is done automatically in the Mutect2 wdl in the gatk repo and on Terra. For example:

for chromosome in {1..22}; do gatk Mutect2 -R ref.fasta -I tumor.bam -L $chromosome --f1r2-tar-gz ${chromosome}-f1r2.tar.gz -O ${chromosome}-unfiltered.vcf`` done all_f1r2_input=`for chromosome in {1..22}; do printf -- "-I ${chromosome}-f1r2.tar.gz "; done` LearnReadOrientationModel $all_f1_r2_input -O read-orientation-model.tar.gz

A step-by-step guide to the new Mutect2 Panel of Normals Workflow

We rewrote CreateSomaticPanelOfNormals to use GenomicsDB for scalability. We also added the INFO fields FRACTION (the fraction of normal samples with an artifact) and BETA (the shape parameters of the beta distribution of artifact allele fractions among samples with the artifact) to the panel of normals. FilterMutectCalls doesn’t use these yet, but we hope to experiment with them in the near future. Furthermore, we never liked how germline variants, which are handled in a more principled way with our germline filter, ended up as hard-filtered pon sites, so the panel of normals workflow now optionally excludes germline events from its output, keeping only technical artifacts.

The three steps to create a panel of normals are:

Step 1: Run Mutect2 in tumor-only mode for each normal sample:

gatk Mutect2 -R reference.fasta -I normal1.bam --max-mnp-distance 0 -O normal1.vcf.gz gatk Mutect2 -R reference.fasta -I normal2.bam --max-mnp-distance 0 -O normal2.vcf.gz Etc

Step 2: Create a GenomicsDB from the normal Mutect2 calls:

gatk GenomicsDBImport -R reference.fasta -L intervals.interval_list \ --genomicsdb-workspace-path pon_db \ -V normal1.vcf.gz \ -V normal2.vcf.gz \ -V normal3.vcf.gz

Step 3: Combine the normal calls using CreateSomaticPanelOfNormals:

gatk CreateSomaticPanelOfNormals -R reference.fasta \ --germline-resource af-only-gnomad.vcf.gz \ -V gendb://pon_db \ -O pon.vcf.gz

In the third step, the AF-only gnomAD resource is the same public GATK resource used by Mutect2 (when calling tumor samples, but not in Step 1, above).

Updated on 2019-07-18

From yyj on 2019-05-27

Hi，@bhanuGandham

gatk Mutect2 -R ref.fasta \ -L intervals.interval_list \ -I tumor.bam \ -germline-resource af-only-gnomad.vcf \ -pon panel_of_normals.vcf \ —f1r2-tar-gz f1r2.tar.gz \ -O unfiltered.vcf

Is this new command line for tumor-only mode？

If this is，How about Tumor with matched normal？

From polymerase on 2019-05-28

Hi，@bhanuGandham

When I try step 2 to create PON, there is a mistake, “A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs. MNP found at chr1:889158 in VCF”.

How to fix this?

From bhanuGandham on 2019-05-29

@polymerase

We have updated the tutorial with GenomicsDBImport command to exclude MNPs using ` —max-mnp-distance 0`. Updated command as below:

`gatk GenomicsDBImport -R reference.fasta -L intervals.interval_list —genomicsdb-workspace-path pon_db —max-mnp-distance 0 -V normal1.vcf.gz -V normal2.vcf.gz -V normal3.vcf.gz`

This will fix the error.

From bhanuGandham on 2019-05-29

@yyj

Command-line examples for tumor and matched normal can be found in the tool documentation here: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php

From alanhoyle on 2019-06-04

How does one source/generate the af-only-gnomad.vcf for the PON creation and mutect2 runs?

From shenglai on 2019-06-05

hi @bhanuGandham

it seems GenomicsDBImport in 4.1.2.0 does not support `max-mnp-distance`.

```

A USER ERROR has occurred: max-mnp-distance is not a recognized option

```

From the wdl here, https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/mutect2_pon.wdl#L9

it seems `max-mnp-distance` should be added to the Mutect2 tumor only calling. I just wanna confirm if I understand it correctly with you

Thanks

From ismael_P on 2019-06-07

Hello,

The documentation doesn’t say it but MergeMutectStats can take a .list file as input if you have a long list of .stats file.

I think you should describe this optionin the help as it is helpfull.

From anamaria on 2019-06-11

Even though it is specified in the post header that “This tutorial is applicable to Mutect2 version 4.1.1.0 and higher”, it appears that `MergeMutectStats` does not exist in version 4.1.0.0. Please correct me if I’m wrong.

From anamaria on 2019-06-12

I just realised this makes no sense and I got the versions mixed up, apologies.

From bhanuGandham on 2019-06-12

@shenglai

That is correct.

From dcampo on 2019-06-25

Hello!

I am trying to run GenomicsDB, but after 12 h, it exits with the following error:

terminate called after throwing an instance of ‘GenomicsDBConfigException‘ what(): GenomicsDBConfigException : Syntax error in JSON file /tmp/loader_2075782628441988593.json

This is my command line:

java -jar $GATK GenomicsDBImport \ -R $REFERENCE/GRCh38.p7.genome.fa \ -L $REFERENCE/MedExome_hg38_capture_targets.bed \ —genomicsdb-workspace-path $VCF/PON/PON_db \ —sample-name-map $VCF/PON/normals_for_pon_map \ —reader-threads 16 \ —verbosity ERROR

These are 5 exome samples, and I’m running on one node with 16 CPUs and 64 GB of RAM in our local cluster.

Also, the tool creates 17,400 directories, all starting with chr1. Does that mean that it is still working on the variants in chr1?? Because if so, even if I make it run, it will take days…am I doing something wrong?

Thanks!

From dcampo on 2019-06-27

Hi,

an update on my comment about “Syntax error in JSON file /tmp/loader_2075782628441988593.json”. I specified a different temp folder, and I am not getting that error anymore. I guess the default temp folder was filling up quickly.

Anyways, I have now the problem with run time. After 24 h, the tool creates thousands of directories, but it’s still working on chr 2, so it would take weeks to finish. And our university cluster has a wall time limit of 24 h…

My command line is the same than I posted above, except with the added “—tmp-dir=$SCRATCHDIR” option.

I don’t think this is supposed to take that long, so I assume I am doing something wrong, but can’t figure out what…any help, please?

Thanks!

From Juls on 2019-07-01

Hi,

@ New Filtering Strategy in FilterMutectCalls

Is there a description available for the (remaining) filters?

These are the new / edited ones, correct?

weak_evidence

slippage

haplotype

germline

Thanks!

From jcx9dy on 2019-07-03

@dcampo I have the same issue... would love to know if you figure it out.

After bouncing through the forums for a while, these are the links I've scrounged up:

https://gatkforums.broadinstitute.org/gatk/discussion/comment/53868/#Comment_53868
https://gatkforums.broadinstitute.org/gatk/discussion/comment/47871/#Comment_47871
https://gatkforums.broadinstitute.org/gatk/discussion/comment/46348/#Comment_46348
https://gatkforums.broadinstitute.org/gatk/discussion/comment/50838/#Comment_50838
firecloud example (scatter interval db and later merge out.vcf):
give up on GenomicsDBImport and switch to CombineGVCFs:

Some unorganized thoughts:

filesystem i/o read/write could really be slowing things down (these issues become more apparent on a cluster, as opposed to cloud instances that abstract away the shared resource limitations)
one recommendation was to scatter/parallelize the db import over many intervals (but still all samples, though perhaps with the --batch 50 option); subsequently merge the interval-specific pooled vcf somehow (how to properly do this? .. in the firecloud example?)
another last resort is to switch back to the old system of using CombineGVCFs which is similar to what Queue used to use?

I vaguely remember using Queue (scala workflow instead of wdl) to first scatter intervals and samples for mutect2 tumor-only mode for normals.bam, and the final step was calling CombineVariants to yield the PON.vcf This last merge step was very fast (~10 hours total for ~40 WGS). Something must've drastically changed about the more recent merging workflows if people report needing hundreds of hours.

From Rishabh on 2019-07-03

@bhanuGandham

I am trying to Create a GenomicsDB from the normal Mutect2 calls

So there are four command that i used and it pop out with different errors but now m i struck .

1. Creating GBImport

java -jar /PathToGatk GenomicsDBImport -R Hg19_Reference.fa -L V6.bed -V normal1.vcf.gz -V normal2.vcf.gz -V normal3.vcf.gz -V normal4.vcf.gz -V normal5.vcf.gz —genomicsdb-workspace-path pon_db A USER ERROR has occurred: Bad input: GenomicsDBImport does not support GVCFs with MNPs.

2. Removing MNP from vcf.gz files

java -jar /PathToGatk SelectVariants -R Hg19_Reference.fa —variant normal1.vcf.gz -O normal1_without_mnp.vcf.gz —xl-select-type MNP

3. Creating GBImport without MNP in vcf.gz files

java -jar /PathToGatk GenomicsDBImport -R Hg19_Reference.fa -L V6.bed -V normal1_without_mnp.vcf.gz -V normal2_without_mnp.vcf.gz -V normal3_without_mnp.vcf.gz -V normal4_without_mnp.vcf.gz -V normal5_without_mnp.vcf.gz —genomicsdb-workspace-path pon_db Duplicate field name AF found in vid attribute “fields“ terminate called after throwing an instance of ‘FileBasedVidMapperException‘ what(): FileBasedVidMapperException : Duplicate fields exist in vid attribute “fields“ Aborted

4. Using —max-mnp-distance 0 option to create GBImport

java -jar /PathToGatk GenomicsDBImport -R Hg19_Reference.fa -L V6.bed -V normal1_without_mnp.vcf.gz -V normal2_without_mnp.vcf.gz -V normal3_without_mnp.vcf.gz -V normal4_without_mnp.vcf.gz -V normal5_without_mnp.vcf.gz —genomicsdb-workspace-path pon_db —max-mnp-distance 0 A USERERROR has occurred: max-mnp-distance is not a recognized option

From jcx9dy on 2019-07-03

@Rishabh

4.) this page has not been corrected yet (one of the admins would need to move the `—max-mnp-distance 0` to step1 instead of step2 in the main post); but from a [comment](https://gatkforums.broadinstitute.org/gatk/discussion/comment/59051/#Comment_59051) in this section, the `—max-mnp-distance 0` is an option for Mutect2, not for GenomicsDBImport. So if you want to use this, you should have used it when creating the normal*.vcf.gz

I’m not sure what (3.) is … and am also not sure that (2.) will be sufficient. You could keep researching (3.) or, the easier/safer option (which is part of the established recommendations) is to just run Mutect2 with the aforementioned parameter to remove MNP.

From alanhoyle on 2019-07-09

Re: PON generation:

For any genome-wide interval sets, I think you’re also going to want to run GenomicsDBImport with the —merge-input-intervals option.

Were running this on TCGA normals that we aligned to the GDC’s GRCh38.d1.vd1 reference with BWA, then sorted/markdup before running through the Mutect2 artifact generation, then using Gencode exons for our —intervals.

We’re still doing testing, but a colleague found the previous version (i.e. without the GenomicsDB) to run far faster when making the PON.

From dcampo on 2019-07-09

@jcx9dy

I did not have time to troubleshoot, so I ended up installing a previous version of GATK to avoid GenomicsDBImport. At some point I will have to come back to this and play around a bit with parallelizing, etc.

That said, I don’t believe it is a matter of run time. Like I said above. these are only 5 exome samples; if after 24 h running, the tool is still working on chr2, it means that the whole thing will take weeks to complete. It has to be something else…

So, any help on this from the team will be greatly appreciated!

Thanks!

From alanhoyle on 2019-07-10

Just as a follow-up: combining 50 TCGA normals (600k-4m variants per artifact VCF; 75.4 million data rows total) resulted in a 108 MB VCF with 7.1 million data rows (i.e. not headers). GenomicsDBImport of the 50 VCFs took 164 minutes (2.7 hr) and created a 2.8 GB pon_db, CreateSomaticPanelOfNormals took 155 minutes (2.6 hr). That’s over 6 hours on some fast machines.

If one neglects the —merge-input-intervals option, I think it runs vastly slower and outputs many more error messages in part 2.

Are these the kinds of numbers we should expect?

From alanhoyle on 2019-07-10

@dcampo try `gatk GenomicsDBImport —merge-input-intervals …` that brought the time down to something that ran in 5-7 hours for us, but that is still far longer than I would have expected.

From tony on 2019-07-15

Hi @bhanuGandham

Regarding pon creation only,

1. we must not provide `—germline-resource` in Mutect2 call of Step1 ?

2. `—max-mnp-distance 0` has to be set in Mutect2 call of Step1 ?

Many thanks,

Anthony

From alanhoyle on 2019-07-16

alanhoyle said: >

dcampo try `gatk GenomicsDBImport —merge-input-intervals …` that brought the time down to something that ran in 5-7 hours for us, but that is still far longer than I would have expected.

Running the same VCFS through the older CreateSomaticPanelOfNormals runs in ~5 minutes, but the PON.vcf doesn’t contain the same information. One hopes that this is worth it?

From bhanuGandham on 2019-07-18

Hi @shenglai

> it seems max-mnp-distance should be added to the Mutect2 tumor only calling. I just wanna confirm if I understand it correctly with you

You are right and thank you for bringing it to our notice. We have corrected this error.

From dcampo on 2019-07-20

@alanhoyle

So I tried re-running with the —merge-input-intervals option, and it ran in literally 1 min…

From weeks to 1 min!!!

That option should really be emphasized in the tool description, or maybe even included by default.

Thanks a lot!

From syer89 on 2019-07-26

Hi @davidben and others, wondering if there a limitation on indel calling with Mutect2. Although it picked some validated INDELS (15bp del, 33bp ins, etc..), I have an 88bp deletion it did not. It was previously picked by HC (v3.2-) at ~14% and with v4.1.2 as well at ~9.75%. I understand there will be differences between the two callers, but it is real and not present in PON to be filtered out. Just want to make sure if this is an expected outcome (like the indel length etc..)

From davidben on 2019-08-08

@syer89 indels of that size are towards the limit of what Mutect2 can handle. There’s no fixed limit, just that our default assembly parameters start missing things around that size.

From ShishiMin on 2019-08-29

Hi,@bhanuGandham

gatk Mutect2 -R ref.fasta \

-L intervals.interval_list \

-I tumor.bam \

-germline-resource af-only-gnomad.vcf \

-pon panel_of_normals.vcf \

—f1r2-tar-gz f1r2.tar.gz \

-O unfiltered.vcf

This new command line is for tumor-only mode, But i want know the command for tumor and matched normal mode. Is the following command right? Let me know if I’m wrong. Thank you! What’more I am confused about that FilterByOrientationBias is only needs to be run on the tumor sample(s). Do I still need —f1r2-tar-gz argument in tumor and matched normal mode?

gatk Mutect2 \ -R reference.fa \ -I tumor.bam \ -I normal.bam \ -normal normal_sample_name \ —germline-resource af-only-gnomad.vcf.gz \ -pon panel_of_normals.vcf \ —f1r2-tar-gz f1r2.tar.gz \ -O unfiltered.vcf

From davidben on 2019-09-03

@ShishiMin your tumor-normal mode command line is correct. This generates raw F1R2 count data for just the tumor sample, which you then process in `LearnReadOrientationModel`, the output of which goes to `FilterMutectCalls`.

From syer89 on 2019-09-12

Hi @davidben , while using GATK4 (v1.4.2.0, mutect2/tumour mode) and testing for reproducibility it seems to vary in terms of #no of variants and annotations. Is that observed and any fixed seed option to use?

From davidben on 2019-09-12

@syer89 There’s no fixed seed yet but that’s a really good idea. I’ll try to do that soon.

From syer89 on 2019-09-19

Thank you @davidben. Any estimate of when the fix will be available? Also, is there documentation to better understand the filteringStats.tsv generated by FilterMutect calls?

From baltimoreoriole on 2019-09-25

Hi, I spent some good amount of time to understand GATK tools including Mutect2. I feel this is a maze of blogs, deprecated pages and variety of options. Could you please help me focus on the tools and steps required. Following is my situation:

1. I have 100 tumor samples with no normal germline samples.

2. Sequencing was done for 4 genes using libraries from Fluidigm followed by Illumina sequencer in a paired end mode.

3. I aligned reads -> SAM -> BAM -> picard fixmate -> picard reorder sam -> GATK base recalibration -> MuTect2 (GATK3)

java -jar $GATK -T MuTect2 \

-R ucsc.hg19.fasta \

-I:tumor Tumor1.bam \

—dbsnp hg19-All_chr.vcf \

—cosmic b37_cosmic_v54_120711_chr.vcf \

-L MyTargetGenes.bed \

-o Tumor1.vcf

Is there a step-wise guide for Tumors samples only GATK4 Mutect2. Also it is confusing how to generate some of the input files that are described here. For example –

-germline-resource af-only-gnomad.vcf \ -pon panel_of_normals.vcf \ —f1r2-tar-gz f1r2.tar.gz \

Where can I find af-only-gnmoad.vcf? Where is ‘panel of normals’ pon panel_of_normals.vcf.

I checked at the following FTP site. From the wide array of choices I am lost and I cannot find which one to use as panel_of_normals.vcf. (ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ )

Could you please help me find a workflow that is best for identifying mutations in each patient tumor sample (100 samples) and with no matching germline normals.

Thank you so much and appreciate your time helping me.

Regards,

Adrian.

From davidben on 2019-09-25

@syer89 I have what I think is a fix but I’m having trouble reproducing the error. Could you post a few examples of sites that change? Do you have a sense of which annotations change? Also, could you post the filtering stats from two runs of the same pipeline that gave different outputs?

There is currently no documentation for the filtering stats, but, briefly, it has these three parts:

1) estimated somatic mutation rates (natural log scale) for SNVs and indels of different lengths

2) summary of the beta mixture model of allele fraction clustering

3) estimated false positive and false negative rate attributable to each filter

We may document it more in the future but it’s a new feature that for now we intend mainly for our own debugging.

From davidben on 2019-09-25

@baltimoreoriole The information here: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php is up-to-date and includes tumor-only calling.

Additionally, our developer docs: https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf are always up-to-date and contain command lines. A complete pipeline would include the Mutect2 command on page 1, (optionally) the LearnReadOrientationModel command on page 8, the GetPileupSummaries and CalculateContamination commands on page 10, and the FilterMutectCalls command on page 3.

You do not need to make you own panel of normals (unless you have a huge number of samples, it may even be counterproductive than our generic public panel). Instead you may use gs://gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf.

For the -germline-resource you should use gs://gatk-best-practices/somatic-b37/af-only-gnomad.raw.sites.vcf.

For the -V and -L arguments to GetPileupSummaries you may use gs://gatk-best-practices/somatic-b37/small_exac_common_3.vcf.

Finally, —f1r2-tar-gz is an output, not an input, of Mutect2.

From baltimoreoriole on 2019-09-25

Hi David, Thanks for your help in pointing me to correct site and resource.

While running gatk ver4 Mutect2 I get this error:

gatk --java-options "-Xmx24G" Mutect2 \ -R ucsc.hg19.fasta \ -L Mygenes.bed \ -I tumor.bam \ -germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz \ -pon Mutect2-exome-panel.vcf \ --f1r2-tar-gz f1r2.tar.gz \ -O unfiltered.vcf

also tried: gatk Mutect2 \ -R ucsc.hg19.fasta \ -L Mygenes.bed \ -I tumor.bam \ -germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz \ -pon Mutect2-exome-panel.vcf \ --f1r2-tar-gz f1r2.tar.gz \ -O unfiltered.vcf

I get an error that f1r2-tar-gz is not a recognized option

What could be wrong.

Thanks Adrian

USAGE: Mutect2 [arguments]

Call somatic SNVs and indels via local assembly of haplotypes Version:4.1.0.0

Required Arguments: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx (xxx -means printing all the options which I am not pasting here)..

A USER ERROR has occurred: f1r2-tar-gz is not a recognized option

Set the system property GATKSTACKTRACEONUSEREXCEPTION (--java-options '-DGATKSTACKTRACEONUSEREXCEPTION=true') to print the stack trace.

From davidben on 2019-09-26

@baltimoreoriole You will need to use a more recent version than 4.1. The current version, 4.1.3.0, is significantly more advanced.

From 2904359495 on 2019-09-27

when will 4.1.4.0 publish, because contamination not good for small panel, thanks a lot

From syer89 on 2019-09-30

@davidben, my observation is that 70% of the variant differences are related to “panel of normals”. Below is one sample rerun example (bwa-mem + markdup is the starting bam) using GATK_v4.1.2.0. I also attached the filtering stats files here.

1st re-run:

(FN variants – basically not found in the 2nd re-run)

4 1961206 . GCC TGA . base_qual;clustered_events;panel_of_normals;strand_bias CONTQ=18;DP=445;ECNT=5;GERMQ=93;MBQ=32,13;MFRL=161,97;MMQ=60,60;MPOS=22;PON;POPAF=7.3 ;ROQ=93;SEQQ=16;STRANDQ=1;TLOD=6.66 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:438,7:0.017:445:144,1:230,0:368,70,0,7

19 11136038 . A C . base_qual;clustered_events;panel_of_normals;weak_evidence CONTQ=89;DP=162;ECNT=6;GERMQ=93;MBQ=28,12;MFRL=186,187;MMQ=60,60;MPOS

=13;PON;POPAF=7.3;ROQ=15;SEQQ=1;STRANDQ=4;TLOD=3.3 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:133,10:0.044:143:51,3:27,1:128,5,10,0

2nd re-run:

(FP variant – occurred in this re-run)

14 106324729 . G C . base_qual;panel_of_normals CONTQ=93;DP=83;ECNT=2;GERMQ=53;MBQ=20,15;MFRL=129,229;MMQ=60,60;MPOS=12;PON;POPAF=7.30;ROQ=26;SEQQ=17;STRANDQ=12;TLOD=6.92 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:64,14:0.149:78:13,5:21,1:36,28,2,12

In terms of annotations, filters like CONTQ, GERMQ, T_LOD values (and more?) vary between the reruns which I guess in turn will affect the filter flags being annotated. For example, the same variant in different re-runs -

1st re-run:

14 106146495 . T TG . haplotypeCONTQ=93;DP=68;ECNT=2;GERMQ=5;MBQ=32,32;MFRL=209,189;MMQ=60,60;MPOS=20;POPAF=7.3;ROQ=93;RPA=1,2;RU=G;SEQQ=93;STR;STRANDQ=9;STRQ=93;TLOD=72.8 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:37,21:0.366:58:15,9:17,10:0|1:106146495_T_TG:106146495:2,35,0,21

19 1612476 . T G . PASSCONTQ=93;DP=174;ECNT=2;GERMQ=93;MBQ=31,22;MFRL=174,195;MMQ=60,60;MPOS=10;POPAF=7.3;ROQ=37;SEQQ=62;STRANDQ=3;TLOD=11.08 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:151,13:0.065:164:52,3:62,4:53,98,1,12

2nd re-run:

14 106146495 . T TG . germline;haplotypeCONTQ=93;DP=68;ECNT=2;GERMQ=3;MBQ=32,32;MFRL=209,189;MMQ=60,60;MPOS=20;POPAF=7.3;ROQ=93;RPA=1,2;RU=G;SEQQ=93;STR;STRANDQ=9;STRQ=93;TLOD=72.8 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:37,21:0.366:58:15,9:17,10:0|1:106146495_T_TG:106146495:2,35,0,21

19 1612476 . T G . strand_biasCONTQ=93;DP=174;ECNT=2;GERMQ=93;MBQ=31,22;MFRL=174,195;MMQ=60,60;MPOS=10;POPAF=7.3;ROQ=37;SEQQ=62;STRANDQ=3;TLOD=11.08 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:151,13:0.065:164:52,3:62,4:53,98,1,12

I hope this is helpful and pls let me know if you need any additional information from me.

GATK_upload.zip

From ShishiMin on 2019-10-03

Hi,@davidben

where can i download this file gs://gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf.?

From ShishiMin on 2019-10-03

@davidben I found it. Thank you.

From jamesp on 2019-11-10

Where does Mutect2 put the .stats file? I can’t find it to delocalize for next steps like filtering. I expected it to be at (in my example) unfiltered.vcf.stats, but that file isn’t produced. (Running this with the [Dockerized GATK](us.gcr.io/broad-gatk/gatk:4.1.4.0).)

Command line:

```

/gatk/gatk —java-options “-Xmx${RAM}G” \ Mutect2 \ —input ${cram} \ —reference ${REFERENCE_FASTA} \ —output unfiltered.vcf

```

GATK library info:

```

10:06:22.798 INFO NativeLibraryLoader – Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so

10:06:24.107 INFO Mutect2 – ——————————————————————————————

10:06:24.108 INFO Mutect2 – The Genome Analysis Toolkit (GATK) v4.1.4.0

10:06:24.108 INFO Mutect2 – For support and documentation go to https://software.broadinstitute.org/gatk/

10:06:24.109 INFO Mutect2 – Executing as root@ebc00af66c31 on Linux v4.19.80+ amd64

10:06:24.114 INFO Mutect2 – Java runtime: OpenJDK 64-Bit Server VM v1.8.0_212-8u212-b03-0ubuntu1.16.04.1-b03

10:06:24.114 INFO Mutect2 – Start Date/Time: November 9, 2019 10:06:22 AM UTC

10:06:24.114 INFO Mutect2 – ——————————————————————————————

10:06:24.115 INFO Mutect2 – ——————————————————————————————

10:06:24.115 INFO Mutect2 – HTSJDK Version: 2.20.3

10:06:24.116 INFO Mutect2 – Picard Version: 2.21.1

10:06:24.116 INFO Mutect2 – HTSJDK Defaults.COMPRESSION_LEVEL : 2

10:06:24.116 INFO Mutect2 – HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false

10:06:24.116 INFO Mutect2 – HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true

10:06:24.116 INFO Mutect2 – HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false

…

12:16:20.925 INFO ProgressMeter – Un_KN707838v1_decoy:901 129.9 10752010 82766.9

12:16:30.929 INFO ProgressMeter – Un_KN707896v1_decoy:301 130.1 10752790 82666.8

12:16:40.934 INFO ProgressMeter – Un_KN707906v1_decoy:996 130.2 10753100 82563.4

12:16:51.217 INFO ProgressMeter – Un_KN707951v1_decoy:1060 130.4 10753580 82458.5

12:17:01.258 INFO ProgressMeter – Un_JTFH01000012v1_decoy:1801 130.6 10754880 82362.8

Using GATK jar /gatk/gatk-package-4.1.4.0-local.jar

```

But then after the run completes, only the unfiltered.vcf file is present, no unfiltered.vcf.stats file is seen:

``` #List files in current path:

/bin/ls

./unfiltered.vcf

```

From kchang3 on 2019-11-21

1. Regarding the PON workflow above, is it still outputting germline variants as default?

2. Does filterMutect step uses PON as a hard filter in version 4.1.4.0, such that if a variant is seen PON then it’s excluded ?

From deena_b on 2019-11-23

Would someone please post a link to the doc page for MergeMutectStats?

Thank you :smile:

From davidben on 2019-11-25

@jamesp Does the output show M2 failing somewhere? That could cause it to produce a VCF but not a stats file.

From davidben on 2019-11-25

@kchang3 If you do not give `CreateSomaticPanelOfNormal` a `—germline-resource` input then the PoN will contain germline variants.

`FilterMutectCalls` uses the PoN as a hard filter.

From kchang3 on 2019-11-26

@davidben I’m running gatk-4.1.4.0 Mutect2 on mice wes tumor and matched normal data with pon as described above. I noticed a few tumor-normal pairs have > 90% mutations filtered due to normal artifact. Could the —normal-p-value-threshold be too stringent for these pairs? How should I go about understanding why these samples have such high normal artifact filtered mutations?

Report abuse