created by Geraldine_VdAuwera
on 2017-05-21
The current recalibration paradigm evaluates each position, and passes or filters all alleles at that position, regardless of how many alternate alleles occur there. This has major disadvantages in cases where a real variant allele occurs at the same position as an error that has sufficient evidence to be called as a variant. The goal of the Allele-Specific Filtering Workflow is to treat each allele separately in the annotation, recalibration and filtering phases.
Multi-allelic sites benefit the most from the Allele-Specific Filtering Workflow because each allele will be evaluated more accurately than if its data was lumped together with other alleles. Large callsets will benefit more than small callsets because multi-allelics will occur more frequently as the number of samples in a cohort increases. One callset with 42 samples that was used for development contains 3% multi-allelic sites, while the ExAC callset [http://biorxiv.org/content/early/2015/10/30/030338] with approximately 60,000 samples contains nearly 8% multi-allelic sites. Recalibrating each allele separately will also greatly benefit rare disease studies, in which rare alleles may not be shared by other members of the callset, but could still occur at the same positions as common alleles or errors.
No additional resource files are necessary, but this workflow does require the sample bam files. Annotations cannot be calculated from VCF or gVCF files alone.
After running the Allele-Specific Filtering Workflow, several new annotations will be added to the INFO field for your variants (see below), and VQSR results will be based on those new annotations, though using SNP and INDEL tranche sensitivity cutoffs equivalent to the non-allele-specific best practices. If after analyzing your recalibrated data, you’re not convinced that this workflow is for you, you can still run the classic VQSR on your genotyped VCF because the standard annotations for VQSR are still included in the genotyped VCF.
Nope. Sorry. The way we generate and combine the allele-specific data depends on having raw data for each sample in the gVCF.
Not yet. Although we are happy with the performance of this workflow, our own production pipelines have not yet been updated to include this, so it should still be considered experimental. However, we do encourage you to try this out on your own data and let us know what you find, as this helps us refine the tools and catch bugs.
Begin with a post-BQSR bam file for each sample. The read data in the bam are necessary to generate the allele-specific annotations.
Using the locally-realigned reads, HaplotypeCaller will generate gVCFs with all of its usual standard annotations, plus raw data to calculate allele-specific versions of the standard annotations. That means each alternate allele in each VariantContext will get its own data used by downstream tools to generate allele-specific QualByDepth, RMSMappingQuality, FisherStrand and allele-specific versions of the other standard annotations. For example, this will help us sort out good alleles that only occur in a few samples and have a good balance of forward and reverse reads but occur at the same position as another allele that has bad strand bias because it’s probably a mapping error.
java -jar $GATKjar -T HaplotypeCaller -R $reference \ -I mySample.bam \ -o mySample.AS.g.vcf \ -ERC GVCF \ -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation
Here the allele-specific data for each sample is combined per allele, but is not yet in its final annotation form. We only do this for computational efficiency reasons when we have >200 samples.
java -jar $GATKjar -T CombineGVCFs -R $reference \ -V mySample1.AS.g.vcf -V mySample2.AS.g.vcf -V mySample3.AS.g.vcf \ -o allSamples.g.AS.vcf \ -G StandardAnnotation -G AS_StandardAnnotation
Note that if you run this, you need to modify the -V
input in the next step to just the combined gVCF file.
Raw allele-specific data for each sample is used to calculate the finalized annotation values. In GATK 3.6, non-allele-specific rank sum annotations are still combined using the median across all samples. (See below for details on more accurate MQ calculations.)
java -jar $GATKjar -T GenotypeGVCFs -R $reference \ -V mySample1.AS.g.vcf -V mySample2.AS.g.vcf -V mySample3.AS.g.vcf \ -o allSamples.AS.vcf \ -G StandardAnnotation -G AS_StandardAnnotation
In allele-specific mode, the VariantRecalibrator builds the statistical model based on data for each allele, rather than each site. This has the added benefit of being able to recalibrate the SNPs in mixed sites according to the appropriate model, rather than lumping them in with INDELs as had been done previously. It will also provide better results by matching the exact allele in the training and truth data rather than just the position.
````
java -jar $GATKjar -T VariantRecalibrator -R $reference -mode SNP -AS \ -an ASQD -an ASFS -an ASReadPosRankSum -an ASMQ -an ASMQRankSum -an ASSOR \ -input allSamples.AS.vcf \ -resource:known=false,training=true,truth=true,prior=15.0 $hapmapsites \ -resource:known=false,training=true,truth=true,prior=12.0 $omnisites \ -resource:known=false,training=true,truth=false,prior=10.0 $training1000Gsites \ -resource:known=true,training=false,truth=false,prior=2.0 $dbSNP_129 \ -tranche 100.0 -tranche 99.9 -tranche 99.8 -tranche 99.7 -tranche 99.5 -tranche 99.3 -tranche 99.0 -tranche 90.0 \ -recalFile allSamples.AS.snps.recal \ -tranchesFile allSamples.AS.snps.tranches \ -modelFile allSamples.AS.snps.report \ -rscriptFile allSamples.AS.snps.R
java -jar $GATKjar -T VariantRecalibrator -R $reference -mode INDEL -AS \ -an ASQD -an ASFS -an ASReadPosRankSum -an ASMQRankSum -an ASSOR \ -input allSamples.AS.vcf \ -resource:known=false,training=true,truth=true,prior=12.0 $indelGoldStandardCallset \ -resource:known=true,training=false,truth=false,prior=2.0 $dbSNP129 \ -tranche 100.0 -tranche 99.9 -tranche 99.8 -tranche 99.7 -tranche 99.5 -tranche 99.3 -tranche 99.0 -tranche 90.0 \ --maxGaussians 4 \ -recalFile allSamples.AS.indels.recal \ -tranchesFile allSamples.AS.indels.tranches \ -modelFile allSamples.AS.indels.report \ -rscriptFile allSamples.AS.indels.R
````
Note that these commands are for exomes. For whole genomes, the classic -DP
annotation will still be used for SNP recalibration, as in the Best Practices.
Allele-specific filters are calculated and stored in the AS_FilterStatus INFO annotation. A site-level filter is applied to each site based on the most lenient filter across all alleles. For example, if any allele passes, the entire site passes. If no alleles pass, then the filter will be applied corresponding to the allele with the lowest tranche (best VQSLOD).
The two ApplyRecalibration modes should be run in series, as in our Best Practices recommendations. If SNP and INDEL ApplyRecalibration modes are run in parallel and combined with CombineVariants (which will work for the standard VQSR arguments), in allele-specific mode any mixed sites will fail to be processed correctly.
````
java -jar $GATKjar -T ApplyRecalibration -R $reference \ -input allSamples.AS.vcf \ -mode SNP --tsfilterlevel 99.70 -AS \ --recalfile allSamples.AS.snps.recal \ --tranchesfile allSamples.AS.snps.tranches \ -o allSamples.AS.snp_recalibrated.vcf
java -jar $GATKjar -T ApplyRecalibration -R $reference \ -input allSamples.AS.snprecalibrated.vcf \ -mode INDEL --tsfilterlevel 99.3 -AS \ --recalfile allSamples.AS.indels.recal \ --tranchesfile allSamples.AS.indels.tranches \ -o allSamples.AS.snpindel_recalibrated.vcf
````
The Allele-Specific Filtration Workflow adds new allele-specific info-level annotations to the VCFs and produces a final output with allele-specific filters based on the VQSR SNP and INDEL tranches.
The ASStandard annotation set will produce allele-specific versions of our standard annotations. For ASMQ, this means that the root-mean-squared mapping quality will be given for all of the reads that support each allele, respectively. For rank sum and strand bias tests, the annotation for each allele will compare that alternative allele’s values against the reference allele.
Each allele will be described in a separate line in the output recalibration (.recal) files. For the advanced analyst, this is a good way to check which allele has the worst data and is responsible for a NEGATIVETRAINSITE classification.
After both ApplyRecalibration modes are run, the INFO field will contain an annotation called ASFilterStatus, which will list the filter corresponding to each alternate allele. Allele-specific culprit and VQSLOD scores will also be added to the final VCF in the ASculprit and AS_VQSLOD annotations, respectively.
3 195507036 . C G,CCT 6672.42 VQSRTrancheINDEL99.80to99.90 AC=7,2;AF=0.106,0.030;AN=66;ASBaseQRankSum=-0.144,1.554;ASFS=127.421,52.461;ASFilterStatus=VQSRTrancheSNP99.90to100.00,VQSRTrancheINDEL99.80to99.90;ASMQ=29.70,28.99;ASMQRankSum=1.094,0.045;ASReadPosRankSum=1.120,-7.743;ASSOR=9.981,7.523;ASVQSLOD=-48.3935,-7.8306;ASculprit=ASFS,ASFS;BaseQRankSum=0.028;DP=2137;ExcessHet=1.6952;FS=145.982;GQMEAN=200.21;GQSTDDEV=247.32;InbreedingCoeff=0.0744;MLEAC=7,2;MLEAF=0.106,0.030;MQ=29.93;MQRankSum=0.860;NCC=9;NEGATIVETRAIN_SITE;QD=10.94;ReadPosRankSum=-7.820e-01;SOR=10.484
3 153842181 . CT TT,CTTTT,CTTTTTTTTTT,C 4392.82 PASS AC=15,1,1,1;AF=0.192,0.013,0.013,0.013;AN=78;ASBaseQRankSum=-11.667,-3.884,-2.223,0.972;ASFS=204.035,22.282,16.930,2.406;ASFilterStatus=VQSRTrancheSNP99.90to100.00,VQSRTrancheINDEL99.50to99.70,VQSRTrancheINDEL99.70to99.80,PASS;ASMQ=58.44,59.93,54.79,59.72;ASMQRankSum=2.753,0.123,0.157,0.744;ASReadPosRankSum=-9.318,-5.429,-5.578,1.336;ASSOR=6.924,3.473,5.131,1.399;ASVQSLOD=-79.9547,-2.0208,-3.4051,0.7975;ASculprit=ASFS,ASReadPosRankSum,ASReadPosRankSum,QD;BaseQRankSum=-2.828e+00;DP=1725;ExcessHet=26.1737;FS=168.440;GQMEAN=117.51;GQSTDDEV=141.53;InbreedingCoeff=-0.1776;MLEAC=16,1,1,1;MLEAF=0.205,0.013,0.013,0.013;MQ=54.35;MQRankSum=0.967;NCC=3;NEGATIVETRAINSITE;QD=4.42;ReadPosRankSum=-2.515e+00;SOR=4.740
Since GATK3.4, GenotypeGVCFs has had the ability to output a “spanning deletion allele” (now represented with *) to indicate that a position in the VCF is contained within an upstream deletion and may have “missing data” in samples that contain that deletion. While the upstream deletions will continue to be recalibrated and filtered by VQSR similar to the way they always have been, these spanning deletion alleles that occur downstream (and represent the same event) will be skipped.
Using the default gVCF bands ([1:60,70,80,90,99]), the raw allele-specific data makes a minimal size increase, which was less than a 1% increase on the NA12878 exome used for development.
If you ran the same callset through GATK 3.4 or earlier and GATK 3.5 or later, you may notice that the MQ annotation values for your variants changed slightly. That’s because with or without allele-specific annotation and filtering, MQ is being calculated in a new more accurate way. The GenotypeGVCFs used to combine each sample’s annotations by taking the median, but new MQ annotation calculation code now combine’s each sample’s data in a more mathematically correct way.
Problem: WARN 08:35:26,273 ReferenceConfidenceVariantContextMerger - WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but 0.00|15005.00|14400.00|0.00 doesn't parse and will not be annotated in the final VC.
Solution: Remember to add -G Standard -G AS_Standard
to the GenotypeGVCFs command
Problem: Standard (non-allele-specific) annotations are missing
Solution: HaplotypeCaller and GenotypeGVCFs need -G Standard specified if -G AS_Standard is also specified.
From bgrenier on 2017-10-13
Hi,
I wonder why the allele specific annotation AS_InbreedingCoeff is not used during the recalibration of SNPs and Indels. Is this a mistake or is there any reason not using this for the allele specific VQSR (because it is still part of the Best Practices) ?
From Sheila on 2017-10-17
@bgrenier
Hi,
I think we have not included it yet in the Best Practices. This may happen in GATK4. Have a look at [this doc](https://software.broadinstitute.org/gatk/documentation/article?id=9622).
-Sheila
From bgrenier on 2017-10-17
@Sheila
I don’t understand : the link you provide is the same page I made this comment about…
My question is : the Best practices of VQSR without the allele specific annotations include InbreedingCoeff in its annotations while the allele specific VQSR does not include AS_InbreedingCoeff and I wonder why?
From Sheila on 2017-10-17
@bgrenier
Hi,
Ah, I thought you were asking if AS annotations are Best Practice.
I am thinking the AS_InbreedingCoeff was left out by accident in this document. You can use it in VQSR (as long as you have at least 10 samples with each alternate allele). I will make a note to add this in the GATK4 docs.
Thanks,
Sheila
From bgrenier on 2017-10-18
@Sheila
Hi,
Thanks for the information, that’s what I thought.
Best,
From gauthier on 2018-01-03
@bgrenier We actually recommend using a hard filter based on heterozygosity before VQSR because we found that VQSR wasn’t strict enough on sites with lots of excess heterozygosity. Currently we’re using the (relatively new) ExcessHet annotation because it’s more robust to small cohort sizes than InbreedingCoeff. Our latest and greatest best practices filter out variants with ExcessHet > 54.69 (We want a cutoff of anything more extreme than a z-score of -4.5 which is a p-value of 3.4e-06, which phred-scaled is 54.69). If your data doesn’t have ExcessHet and has enough samples to calculate InbreedingCoeff, we recommend filtering variants with InbreedingCoeff < -0.3.
These are based on the non-allele-specific versions of the annotations because we’ve found that cutoffs based on the allele-specific versions were too strict.
As part of the upcoming GATK4 release, we’ll be updating the posted best practices and you should see the new filtering workflow there soon.
From init_js on 2019-05-11
Does GenomicsDBImport import allele-specific annotations by default? In 4.1.1.0, GenomicsDBImport does not accept `-G XXX` options, nor does it seem to document any option related to allele-specific annotations. I read that the format supports storing the annotations (https://software.broadinstitute.org/gatk/documentation/article?id=11813) , but I don’t see how to to toggle the behavior.
Can you please confirm if GenomicsDBImport tallies up the annotations correctly in its output?
Perhaps this is a red herring or a spurious warning, but when I do `HaplotypeCaller -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation …` (on each sample) followed by `GenomicsDBImport …`, and finally `GenotypeGVCFs -G StandardAnnotation -G AS_StandardAnnotation …`, GenotypeGVCFs warns about AS_ annotations and a few others:
```
WARNING: No valid combination operation found for INFO field AS_InbreedingCoeff – the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field AS_QD – the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS – the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff – the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC – the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF – the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field AS_InbreedingCoeff – the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field AS_QD – the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS – the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff – the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC – the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF – the field will NOT be part of INFO fields in the generated VCF records
```
From gauthier on 2019-05-13
GenomicsDB does import allele-specific annotations by default.
The warnings you’re seeing are for annotations that are recalculated after genotyping, and thus don’t need to be merged. For example, only after genotyping with GenotypeGVCFs do we have the maximum likelihood allele count (MLEAC).