howto Apply hard filters to a call set

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2013-06-17

Objective

Apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available.

Caveat

This document is intended to illustrate how to compose and run the commands involved in applying the hard filtering method. The annotations and values used may not reflect the most recent recommendations. Be sure to read the documentation about why you would use hard filters and how to understand and improve upon the generic hard filtering recommendations that we provide.

Steps

Extract the SNPs from the call set
Determine parameters for filtering SNPs
Apply the filter to the SNP call set
Extract the Indels from the call set
Determine parameters for filtering indels
Apply the filter to the Indel call set

1. Extract the SNPs from the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R reference.fa \ -V raw_variants.vcf \ -selectType SNP \ -o raw_snps.vcf

Expected Result

This creates a VCF file called raw_snps.vcf, containing just the SNPs from the original file of raw variants.

2. Determine parameters for filtering SNPs

SNPs matching any of these conditions will be considered bad and filtered out, i.e. marked with a filter name (which you specify in the filtering command) in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the SNP using the culprit annotation. SNPs that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

FisherStrand (FS) 60.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

RMSMappingQuality (MQ) 40.0

This is the Root Mean Square of the mapping quality of the reads across all samples.

MappingQualityRankSumTest (MQRankSum) -12.5

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

ReadPosRankSumTest (ReadPosRankSum) -8.0

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

StrandOddsRatio (SOR) 3.0

The StrandOddsRatio annotation is one of several methods that aims to evaluate whether there is strand bias in the data. Higher values indicate more strand bias.

3. Apply the filter to the SNP call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R reference.fa \ -V raw_snps.vcf \ --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \ --filterName "my_snp_filter" \ -o filtered_snps.vcf

Expected Result

This creates a VCF file called filtered_snps.vcf, containing all the original SNPs from the raw_snps.vcf file, but now the SNPs are annotated with either PASS or my_snp_filter depending on whether or not they passed the filters.

For SNPs that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each SNP failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

4. Extract the Indels from the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R reference.fa \ -V raw_HC_variants.vcf \ -selectType INDEL \ -o raw_indels.vcf

Expected Result

This creates a VCF file called raw_indels.vcf, containing just the Indels from the original file of raw variants.

5. Determine parameters for filtering Indels.

Indels matching any of these conditions will be considered bad and filtered out, i.e. marked with a filter name in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the indel using the culprit annotation. Indels that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

FisherStrand (FS) 200.0

ReadPosRankSumTest (ReadPosRankSum) 20.0

StrandOddsRatio (SOR) 10.0

The StrandOddsRatio annotation is one of several methods that aims to evaluate whether there is strand bias in the data. Higher values indicate more strand bias.

6. Apply the filter to the Indel call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R reference.fa \ -V raw_indels.vcf \ --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \ --filterName "my_indel_filter" \ -o filtered_indels.vcf

Expected Result

This creates a VCF file called filtered_indels.vcf, containing all the original Indels from the raw_indels.vcf file, but now the Indels are annotated with either PASS or my_indel_filter depending on whether or not they passed the filters.

For Indels that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each Indel failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

Updated on 2018-11-27

From jfb on 2013-06-29

Is there a clean way to avoid all the warning messages that arise when there are no alternate reads and one is using ReadPosRankSum or MQRankSum to filter variants? In a Dec2012 comment to kurt (http://gatkforums.broadinstitute.org/discussion/1996/selectvariants-mqranksum-or-readposranksum), ebanks suggested a work around using “isHomVar == 1 && …”, but I couldn’t get that to work. Maybe another (better?) way would be to require more than two non-zero entries in BaseCounts—but I couldn’t get that to work either. any suggestions?

From jfb on 2013-06-29

From volavii on 2013-07-05

maybe i get this wrong, but why should i filter by haplotypeScore? It describes the consistency of a haplotype, so its more a filter for heterozygous sites. So when i would have in the reference A and the alternative allele is G, and the genotype is 1/1, but the haplotypescore is high (>13), this SNP would be filtered out, right?

From ebanks on 2013-07-08

The Haplotype Score describes the likelihood that the reads at the site come from at most two haplotypes. So, yes, it does make sense even in the homozygous case.

From mjmaurer on 2013-10-07

Hello,

I am unclear whether it is still recommended (since the new best practices guidelines for GATK 2.7) to use a max DP filter when performing hard filtering on variants called from whole genome data. Does the following still apply?

“The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn’t be used because the relationship between misalignments and depth isn’t clear for capture data.”

Also, it is ever recommended to use a minimum depth of coverage filter?

Thanks,

Matt

From Geraldine_VdAuwera on 2013-10-11

Hi Matt,

Sorry for the delayed response, your question slipped through my net.

Yes, that remark still applies, as indicated in [the document on hard filtering](http://www.broadinstitute.org/gatk/guide/article?id=3225). We don’t normally recommend filtering on depth of coverage (see our default recommendations in the document linked above), but if you do you should keep those caveats in mind.

From mmterpstra on 2013-10-29

@jfb / other people that think that the VariantSelection/Filteringtools tools spit out too many errors:

the errors only give error on missing values an do not mean anything. Although if you want to troubleshoot you commandline you can tap into the underlying variantcontext in your jexl expression to do this(and many other things). Here are in some options I specified with the VariantFiltration Tool: —filterExpression “vc.hasAttribute(‘ReadPosRankSum’) && ReadPosRankSum < -20.0” \ —filterName “HRUNgt5” \ —filterExpression “vc.hasAttribute(‘RPA’) && RPA > 8” \ —filterName “RPAgt8” \

The “vc.hasAttribute(‘attribute’)” part returns true if variable exists and false if the variable does not exists. When false the second statement is not evaluated. Good luck at troubleshoot!!!

@everyone

Also my question: What is the easiest way to select String variables (without using variantcontext ) I now use:

“(vc.hasAttribute(‘SNPEFF_IMPACT’) && vc.getAttribute(‘SNPEFF_IMPACT’).equals(‘HIGH’))“

This thing woks, but completely unfriendly to beginners….(Took a day or two to figure this one out….)

From Geraldine_VdAuwera on 2013-10-29

Hi @mmterpstra, we realize this is a little rough, but we’re planning to put together a cookbook of JEXL expressions to help new users get over that initial hump.

From Bianca_T on 2013-11-21

Hi,

I am working with a dataset from Ion Torrent (Ampliseq targeted resequencing on human samples). I have called the variants with HC (GATK 2.7.2) and would like now to apply hard filters. Seen the different kind of sequencing errors produced by Ion Torrent, how do you suggest I modify the filters? Applied as they are now, a huge number of indels are filtered out but those include my control known mutations.

Thanks for your help.

/Bianca

From ebanks on 2013-11-21

Hi Bianca, we don’t have much experience trying to get good calls out of Ion data. But I can say that from the little we’ve seen, we’re doubtful that it’s at all possible to call indels accurately from that data. If you do have success please post your findings here as we’d all love to know how it’s done!

From bwubb on 2014-02-20

Would it be recommended to apply any hard filters to MULTIALLELIC_COMPLEX variants? (And is that GATK long form of MNP?). If so would it be treated more like an indel? Thank you.

From Geraldine_VdAuwera on 2014-02-20

Hi @bwubb,

Can you tell me where you encountered this name? I’m not familiar with it and I can’t find it used anywhere in our codebase. The types of variants we work with are listed here:

http://www.broadinstitute.org/gatk/guide/article?id=3682

From bwubb on 2014-02-21

Hi @Geraldine_VdAuwera

Well this is in a vcf called by UG. I could provide the the settings I used, but the are more or less in line with the Updated Best Practices for 2.8-1, though the term is appears in the VariantType annotation. Here is an example (Sorry the post formatting gets weird):

1 45787220 . CT CTT,C 17247.46 PASS AC=20,6;AF=0.029,8.850e-03;AN=678;BaseQRankSum=-1.230;DP=59253;FS=11.224; InbreedingCoeff=-0.0093;LowMQ=0.0052,0.0093,59253;MLEAC=19,3;MLEAF=0.028,4.425e-03;MQ=59.43;MQ0=0;MQRankSum=-0.865; QD=4.76;RPA=12,13,11;RU=T;ReadPosRankSum=2.225;STR;VariantType=MULTIALLELIC_COMPLEX.Other GT:AD:DP:GQ:PL

My guess would be Indel, and while this particular variant “PASSED”, now based on your comment Im not certain if the the filters are actually considered. Thank you.

From Geraldine_VdAuwera on 2014-02-21

Ah, found it. “MULTIALLELIC_COMPLEX” refers to a multiallelic variant that is neither a simple deletion nor a simple insertion (so it’s a mix of both). I couldn’t find it in the code because the complete string doesn’t appear as such, it’s a composite of “MULTIALLELIC_” + insertions, deletion or complex base on case.

So yes, it’s a sort of multiallelic indel. If you used a complex filtering expression, you might consider re-running it through each filter individually to make sure that it’s getting evaluated properly on each property.

From yeaman on 2014-03-17

Hello,

I was wondering if you have any specific suggestions for what kind of filtering criteria we should use when creating a new dbsnp for a non-model organism? I was assuming I would just filter based on the site quality score, but now that seems like it might be a coarse approach. I assume at this stage it is good to be very stringent? Would it be appropriate to use the values described above in the documentation for step #2?

Thanks!

From Geraldine_VdAuwera on 2014-03-17

Hi @yeaman,

The values above should provide a good starting point (definitely better than just filtering on quals) but you’ll need to experiment a little to find the values that are right for your data, and to produce the tradeoff between sensitivity and specificity that is right for your project. Good luck!

From yeaman on 2014-03-17

Thanks for the quick reply! I’ll try that out.

From davela3 on 2014-05-01

Hi @Geraldine_VdAuwera,

Is there a reason why the examples have the —interval/-L argument set to “20”? I apologize if this question is a bit trivial, but what would determine whether we use the argument or not? In my case, I only used intervals during the ReAligner phase of the pipeline.

From Sheila on 2014-05-01

@davela3‌

Hi,

In the example, this means to look only at chromosome 20. The -L argument, although an “interval” argument, can also be used to specify a chromosome to focus on.

If running on a whole genome, you would not specify -L. If running on an exome, you can use it to provide a list of capture targets.

I hope this answers your question.

-Sheila

From davela3 on 2014-05-01

Thanks so much, @Sheila!

From davela3 on 2014-05-13

Hi,

For other SNP filtering parameters, should we also consider the GT in conjunction with the GQ (ie. only want 0/1 or 1/1 with a GQ or 20 or more)?

Similarly, are there are parameters we should consider for indels?

I used HaplotypeCaller on human samples (single gene target) – but I don’t seem to have HaplotypeScore in my filteredvcf files. Is this an assumed error on my part?

From Geraldine_VdAuwera on 2014-05-13

@davela3,

> should we also consider the GT in conjunction with the GQ (ie. only want 0/1 or 1/1 with a GQ or 20 or more)?

No, we do not make such a distinction, but feel free to experiment and see what works best in your data.

> are there are parameters we should consider for indels?

Just the ones in the document above. Same comment as for SNPs: it’s always a good idea to experiment and see fits what your dataset.

HaplotypeCaller does not emit HaplotypeScore because that metric is already taken into account during the calling process.

From davela3 on 2014-05-13

@Ger As usual, thank you for your response and insight!

From noushin6 on 2014-05-21

Hi @Geraldine_VdAuwera‌,

I am using GATK 3.1, trying to apply the hard filter described here. I noticed that I get slightly different results using “MappingQualityRankSumTest” vs “MQRankSum” in filter definition, with the latter being more strict of a filter. My guess would be that if anything, the walker is respecting “MQRankSum” and not the other, but that is in contrast with usage suggested here. Does it recognize both?

Thank you as always!

Noushin

From Sheila on 2014-05-22

@noushin‌

Hi Noushin,

Those are two different names for the same filter.

Please tell me what the command lines you used and what is the exact difference between the two results.

Thanks,

Sheila

From leoboki on 2014-07-15

Hi,

I understand that MQRankSum and ReadPosRankSum are z-approximation from Mann-Whitney Rank Sum Test. Extreme values of these statistics indicate strong biases for mapping qualities and read positions. Intuitively, to set up thresholds for hard filtering, the criteria should be two-tailed in these cases, that is, MappingQualityRankSum < -12.5 || MappingQualityRankSum > 12.5 || ReadPosRankSum < -8.0 || ReadPosRankSum > 8.0. However, in the example above, the filtering criteria include only “MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0.”

Does the expression “MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0” implicitly imply that these thresholds are two-tailed?

Thanks,

Leo

From Geraldine_VdAuwera on 2014-07-16

Hi @leoboki,

No, we only check whether the alt bases have lower scores than those of the ref bases.

From wendy on 2014-10-07

Hi all,

After I use hard filtering.I got two data(filtered_snps.vcf and filtered_indels.vcf ). Should I merge these two data for the following steps?

Thanks for help!

Wendy

From Sheila on 2014-10-07

@wendy‌

Hi Wendy,

Yes, you can use CombineVariants for this step.

-Sheila

From sletort on 2014-10-14

Hi,

Just to mention it, space in my filter make the command fails with ‘Invalid argument value’

I’m using bash.

From Zaag on 2014-10-17

There is a small error in step 5, should be indels i guess.

From Sheila on 2014-10-17

@Zaag‌

Fixed :)

Thanks,

Sheila

From metheuse on 2014-10-29

I don’t know why some variants were missing after I applied selectvariants to separate them into snps and indels. Like the following one:

`chr6 106965216 rs200196517 TACAC CACAC,T 12574.57 . AC=3,1;AF=0.375,0.125;AN=8;BaseQRankSum=2.23;ClippingRankSum=0.086;DB;DP=875;FS=0.820;GQ_MEAN=3385.00;GQ_STDDEV=1302.81;MLEAC=3,1;MLEAF=0.375,0.125;MQ=54.83;MQ0=0;MQRankSum=-8.372e+00;NCC=0;QD=15.30;ReadPosRankSum=-1.102e+00 GT:AD:DP:GQ:PL 0/1:127,120,0:247:99:5115,0,4980,5501,5375,10875 0/2:133,0,90:223:99:3404,3817,9566,0,5750,5473 0/1:120,97,0:217:99:3367,0,4846,3737,5138,8875 0/1:110,23,2:135:99:1789,0,4475,2096,4671,6758`

I used exactly the same commands listed on this page to select snps and indels. But the above variant is in neither of them.

I think this variant could be either a SNP (for the allele “CACAC”) or an INDEL (for the allele “T”). Is this why the program cannot decide its category so it discarded it?

It’s not clear to me how I should deal with this case. Any advice? Thanks.

From Kurt on 2014-10-29

That would be `-selectType MIXED` I believe.

From metheuse on 2014-10-30

> @Kurt said:

> That would be `-selectType MIXED` I believe.

Thanks. Sorry I didn’t notice that before. But what filtration I should do with these “mixed” variants, since SNP and INDEL have different filtration?

Does it make sense to rewrite this kind of position into two lines (one SNP, and one INDEL) and then apply two filtration separately?

From Geraldine_VdAuwera on 2014-11-03

We haven’t examined this extensively so it’s up to you, but FYI during VQSR the mixed type variants are processed with the indels. Best thing would be to try both approaches and compare results…

From sibscc on 2014-11-07

Can VariantFiltration just filter the variant by depth?

From Geraldine_VdAuwera on 2014-11-07

Sure, if you want to. We don’t recommend it though, at least not as a primary way of filtering your callset.

From sibscc on 2014-11-07

> @Geraldine_VdAuwera said:

> Sure, if you want to. We don’t recommend it though, at least not as a primary way of filtering your callset.

Thank you.

Is there a document including all the filters in “—filterExpression”?

From Geraldine_VdAuwera on 2014-11-07

You can compose any filter you want using the site-level annotations that are listed in the [Variant Annotations documentation](https://www.broadinstitute.org/gatk/guide/tooldocs/#VariantAnnotations).

From sibscc on 2014-11-12

@Geraldine_VdAuwera

Hi, I am wondering if I do not specify SNP or INDEL, it will filter both SNPs and INDELs.

From Geraldine_VdAuwera on 2014-11-12

@sibscc That’s correct, it will filter both. That’s why we recommend separating them into different files.

From Gangpao on 2015-03-29

Hi there! i meet an error message in the filters. commands as follows:

add an annotation

java -Xmx5g -jar $gatk -T VariantAnnotator -R $refdir/ucsc.hg19.fasta --variant $otherdir/test.recalibratedsnpsrawindels.vcf -I $bamdir/T-SZ-03-1.dedupped.realigned.recal.bam -o $otherdir/test.annotated.recalibratedsnpsrawindels.vcf -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -L $otherdir/test.recalibratedsnpsrawindels.vcf --dbsnp $refdir/dbsnp138.hg19.vcf

1. Extract the SNPs from the call set

java -jar $gatk -R $refdir/ucsc.hg19.fasta -T SelectVariants -V $otherdir/test.annotated.recalibratedsnpsrawindels.vcf -selectType SNP -o $otherdir/test.recalibrated_snps.vcf

2. Apply the filter to the SNP call set

java -jar $gatk -R $refdir/ucsc.hg19.fasta -T VariantFiltration -V $otherdir/test.recalibratedsnps.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "mysnpfilter" -o $otherdir/test.recalibrated.filtered_snps.vcf

the warn as below: WARN 22:02:16,983 Interpreter - ![63,84]: 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0;' undefined variable MappingQualityRankSum

so i have tried ”MappingQualityRankSum < -12.5” “MQRankSum < -12.5” “MappingQualityRankSumTest < -12.5” , none of them work.

From Geraldine_VdAuwera on 2015-03-30

@Gangpao Sorry, the doc is a little out of date. HaplotypeScore should no longer be used, and MappingQualityRankSum should be replaced by MQRankSum. We’ll update the doc asap.

From agopal on 2015-05-17

Hi,

Why is MQ not recommended for indel filtering, the way it is for SNP filtering?

Thanks!

From Geraldine_VdAuwera on 2015-05-19

Most annotations related to mapping quality have been removed from the filtering recommendations for indels because there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.

From Jeegar on 2015-09-03

My variant call set is too small to be calibrated by VQSR and also till now there has been no SNP studies for my organism, Pseudogymnoascus destructans. In that case should I apply Hard filter? If Yes then is it necessary to mention -L 20 for my organism. Also, during the filtering step, how and where should I give commands for the filters?

java -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R reference.fa \ -V raw_snps.vcf \ —filterExpression “QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0” \ —filterName “my_snp_filter” \ -o filtered_snps.vcf

In this command how should I specify which filter to use?

From Sheila on 2015-09-08

@Jeegar

Hi,

Yes, for callsets too small for VQSR, you should use hard filtering. You may need to adjust the thresholds and filters to get the best results.

The -L 20 is only for running on chromosome 20, but in your case, I do not think you need to specify it.

-Sheila

From Jeegar on 2015-09-08

@Sheila I have used default thresholds which is given in the hard filtering page of GATK. My VCF file with filtered SNPs contains around 50k SNPs with a PASS. I dont believe all of them are true SNPs because genome size of my organism is 30MB and the organism produces asexually. In that case what selective filteration command should I use to get true SNPs ??

From Sheila on 2015-09-09

@Jeegar

Hi,

Ah, that is something you will have to experiment with. Our recommendations are based loosely on human genomes. It is up to you to decide how many SNPs should pass the filters and how stringently you set your thresholds to get those. A good place to start is by plotting the various annotations and seeing how the distributions look. Based on where many of the variants lie, you can choose your filters.

Good luck!

-Sheila

From aneek on 2015-09-09

Hi,

After hardfiltering of a single sample VCF file, I am getting a filtered VCF file that has my_snp_filter in the FILTER column of some variants. While most of the variants are given PASS and a few as LowQual and LowQual;my_snp_filter. I am not understanding why is it putting the filter name instead of giving the PASS or low quality comments for some variants.

I am attaching here the part of the VCF file in excel format. Please have a look and suggest.

The commandlines I have used here are

java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R hs37d5.fa \ -V unfiltered.vcf \ -selectType SNP \ -o unfiltered_snps.vcf

and

java -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R hs37d5.fa \ -V unfiltered_snps.vcf \ —filterExpression “QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0” \ —filterName “my_snp_filter” \ -o filtered_snps.vcf

and as same for Indels as given in the tutorial. Then I have used CombineVariants to combine the filtered SNP and Indel vcfs.

HD21A_hardfiltered_FS_annotated_part.xlsx

From Sheila on 2015-09-09

@aneek

Hi,

The LowQual is emitted when the Qual score is under some threshold (30 I think). LowQual is emitted regardless of your filters. The reason some of your variants have my_snp_filter is because they did not pass one or more of your filter expressions. If you want to know exactly which filter it did not pass, you will have to create a filter name for each of your filter expressions.

-Sheila

From aneek on 2015-09-10

@Sheila

Thank you very much for the quick response. So, in that case I guess the commandline should like as follows

java -jar GenomeAnalysisTK.jar \

-T VariantFiltration \

-R hs37d5.fa \

-V unfiltered_snps.vcf \

—filterExpression “QD < 2.0” \

—filterName “QD” \

—filterExpression “FS > 60.0”\

—filterName “FS“

—filterExpression “MQ < 40”\

—filterName “MQ”\

—filterExpression “MQRankSum < -12.5”\

—filterName “MQRankSum”\

—filterExpression “ReadPosRankSum < -8.0”\

—filterName “ReadPosRankSum”\

-o filtered_snps.vcf

Please correct me if I am wrong. Thank you.

From Sheila on 2015-09-10

@aneek

Hi,

That is correct :smile:

-Sheila

From aneek on 2015-09-18

@Sheila

Thank you very much..

From cristianro87 on 2015-10-30

Hello,

I follow the GATK good practices to call variants on 11 samples from a custom truseq design. I do Joint genotyping and apply Hard Filters to my data http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set.

In a second test, i skip the joint genotyping step, and just apply hard filters to my individuals VCFs.

With boot strategies i found the same variant in the same individual wich is a pathogenic variant for a disease i was interested in.

For this variant i have a read depth of 180, and i see A:103 and G:77, G is the reference, there aren’t homopolymers near. But the problem is that this position is ONLY covered by forward reads.

The FS calculated by GATK for this position is FS=0.000

FS is Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

So a value of 0 must be indicative of no bias, but this position is only covered by forward reads, so i undoubtedly had strand bias in this position.

I’m losing something? i can have confidence on this variant?

From Sheila on 2015-11-02

@cristianro87

Hi,

Can you post an IGV screenshot of the region from the original bam file and bamout file? https://www.broadinstitute.org/gatk/guide/article?id=5484

Thanks,

Sheila

From cristianro87 on 2015-11-03

Thanks @Sheila,

now it shows coverage in the reverse strand also.

I see that i’m missing a lot of what gatk really do. I think i must read everything in this site.

![](http://s4.postimg.org/71olh8vul/gatk_test.png)

From Nanda on 2015-11-04

I am currently working on influenza virus and ebola virus. I have 45 influenza samples, so I have 45 bam files aligned with the influenza reference genome.fa.

java -Xmx16g -Djava.io.tmpdir=$out_folder/tmp -jar GenomeAnalysisTK.jar \

-T UnifiedGenotyper \

-nt 12 \

-dcov 10000 \

-glm BOTH \

-R influenza.fa \

-l INFO \

-o A_California_Influenza_Virus.raw.vcf \

—sample_ploidy 1 \

$INPUT_BAM_FILES

I got the raw VCF file (A_California_Influenza_Virus.raw.vcf) for 45 samples in the single VCF. I have 1400 VCF records in the raw VCF file. As per the GATK best practice pipeline research paper, I applied hard filtering option for small datasets.

_Is my VCF records small to go for hard filtering? _

Then I selected snps alone in a separate VCF file.

java -jar /data1/software/gatk/current/GenomeAnalysisTK.jar -T SelectVariants -R A_California_Influenza_Virus_H1N1.fa -V A_California_Influenza_Virus.raw.vcf -selectType SNP -o VariantFiltering/A_California_Influenza_Virus.raw.snps.vcf

Then I applied hard filtering for SNPs.

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R A_California_Influenza_Virus_H1N1.fa -V VariantFiltering/A_California_Influenza_Virus.raw.snps.vcf —filterExpression “QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0” —filterName “myfilter1” -o VariantFiltering/A_California_Influenza_Virus.filtered.snps.vcf

I understand that the variants matching the above conditions are bad variants. **What does QD < 2.0 mean?

What does FS > 60.0 means?

What does MQ < 40.0 ?

What does MQRankSum < -12.5?

What ReadPosRankSum < -8.0?

What is the threshold for high confidence variants QD, FS, MQ, MQRankSum, ReadPosRankSum, DP?**

From Geraldine_VdAuwera on 2015-11-26

@Nanda sorry for the late response, your question slipped off our radar.

We have some explanations of what these annotations mean in the tool docs section of the documentation guide, and @Sheila is starting a new project to improve the documentation with more detail that will help users choose appropriate thresholds.

From Nanda on 2016-01-25

Dear Geraldine,

Is the new project with the documentation for thresholds is ready?

If yes, can you provide me the link please.

Thanks

From Sheila on 2016-01-25

@Nanda

Hi,

Sorry, I have been busy with other items on my to-do list. I will get back to finalizing the document soon. Because you reminded me, I will try to have it published by the beginning of next week :smile:

-Sheila

From mglclinical on 2016-01-26

Hi GATK Team,

Are these hard filtering steps are recommended for only VCF files created by GATK’s haplotype caller ?

Can these hard filtering steps be extended for VCF files created by non-GATK variant callers like Samtools-Mpileup, Freebayes etc ?

Thanks,

mglclinical

From everestial007 on 2016-01-26

Sheila said: >

Nanda

> Hi,

> Sorry, I have been busy with other items on my to-do list. I will get back to finalizing the document soon. Because you reminded me, I will try to have it published by the beginning of next week :smile:

> -Sheila

I have also been awaiting for this document. Thanks that it will be available soon.

From Geraldine_VdAuwera on 2016-01-26

@mglclinical I would say these recommendations are specific to HaplotypeCaller VCFs because they depend on annotations calculated a certain way, and I can’t say whether other callers use equivalent calculations for annotations that have the same name. You could plot the values you get from different callers on many samples and see if they’re equivalent.

From mglclinical on 2016-01-28

Thanks @Geraldine_VdAuwera for the quick reply

From everestial007 on 2016-02-03

@Sheila

Do we have the documentation on thersholds for variant filtering available?

Thanks,

From Sheila on 2016-02-03

@everestial007

Hi,

Sorry for the delay! I have been sick so haven’t been able to get back to it. I will finish my draft by tonight and have it reviewed by the team soon. I know I said it would be out by early this week, but I think it should be out by the end of the week.

-Sheila

From everestial007 on 2016-02-05

@Sheila

That’s fine. Please update me when its done. At the meantime I am trying a different methods to see if I can get a better way of filtering my data. But, I am guess that the documentation from the GATK will be a boon.

Thanks,

From adickey on 2016-02-23

I get an ambiguous error when I try extracting the SNPs and Indels from the raw_variants file. The terminal window appears to be showing the contents of the reference.fa.fai file followed by ##### ERROR -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I'm limited to how far I can scroll up my terminal window and so am not sure how to find the cause of the error message. Sorry if this is the wrong forum for this type of post.

From Geraldine_VdAuwera on 2016-02-23

@adickey It would be better to open a new discussion. We’ll also need more information about the error — your command line and the full error message. We can’t do anything without that. You may need to increase your terminal buffer, or save the log output to file, in order to get that information.

From Nanda on 2016-03-01

Dear shiela and Geraldine hope you are doing good.

Is the new project with the documentation for thresholds is available now?

From Sheila on 2016-03-03

@Nanda

Hi,

I am sorry for the delay. I have been preparing for a workshop. However, have a look at [this document](https://drive.google.com/a/broadinstitute.org/file/d/0BwTg3aXzGxEDckQyVUNvNE9BYU0/view) that gives a basic overview of the filtering techniques.

-Sheila

From Nanda on 2016-03-07

Thanks Sheila for your wonderful efforts.Sure I will go throughout this document.

Cheers,

Nanda

From Will_Gilks on 2016-03-21

Hi Sheila Geraldine_VdAuwera

I have two questions:

1. The recommended hard-filters between SNPs and indels differ conspicuously. This is most prominent with the tenfold relaxation of Fisher Strand bias threshold. I suspect this is because of the different properties of the two event types. Could you clarify why the filters are different? Apologies if this is stated somewhere already.

2. I removing SNPs close to indel boundaries still recommended – having performed GATK local realignment and HaplotypeCaller – or do these processes mean that SNPs near indels are now called more accurately? If boundary SNPs should be excluded, what distance from the indel edge is recommended? (I don’t expect there to be an established distance but an approximation would be good.)

Best regards,

William Gilks

Variant recommended hard-filters (from above):

SNPs:

QualByDepth (QD) 2.0

FisherStrand (FS) 60.0

RMSMappingQuality (MQ) 40.0

MappingQualityRankSumTest (MQRankSum) 12.5

ReadPosRankSumTest (ReadPosRankSum) 8.0

Indels:

QualByDepth (QD) 2.0

FisherStrand (FS) 200.0

ReadPosRankSumTest (ReadPosRankSum) 20.0

From Geraldine_VdAuwera on 2016-03-23

Hi @Will_Gilks,

That’s right, indels tend to have different “healthy” distributions of values for some annotations compared to SNPs because of their fundamental properties. For example, mapping-related annotations will generally look a bit worse for indels because the alignment scoring algorithms will penalize longer events. An annotation like FisherStrand will also take on higher values because there may be mapping issues or positional issues that affect reads supporting indels.

No, we don’t recommend removing SNPs that are close to indels systematically. The graph assembly step done by HaplotypeCaller improves our ability to make sense of complex regions and reduces the amount of false positives that would arise around indels. This doesn’t mean artifacts no longer occur at all — but it should be infrequent enough to make systematic filtering unnecessary.

From vsvinti on 2016-04-08

Hi there,

Is SOR not recommended in ‘best practices’ for variant hard filtering? It is missing above, but I see it in the Manual filtering (Pretoria 6/2015) powerpoint slides. Please clarify.

Thanks,

Vicky

From Geraldine_VdAuwera on 2016-04-08

Hi @vsvinti,

Sure, here are a few clarifications. First, the tutorials are meant to show examples of how to run the tools, and may not be completely up to date with best practice commands and parameters. Second, technically hard filtering is not Best Practices — it’s a fallback method provided for cases where VQSR cannot be applied. As detailed in [this article](https://www.broadinstitute.org/gatk/guide/article?id=3225), our recommendations for hard filtering should be seen as a starting point from which you can proceed to refine the filtering to suit your project needs. We do recommend using SOR in that context.

From vsvinti on 2016-04-08

thanks @Geraldine_VdAuwera

From benjaminpelissie on 2016-07-21

Hi,

I am in the process of hard filtering my cohort VCF (ie. HaplotypeCaller in GVCF mode followed by GenotypeGVCFs on my 85 g.VCF files) because I a working on a non-model species and do not have access to any known variants database. To determine the parameters I have to use to filter my variants, I plot the annotations as advised in the doc #6925. The plots I obtain do not look like those provided in the documentation, and since I do not know wether I would expect to observe similar patterns, I am a little confused. What I need to know is if my plots still make sense, or if they show a totally wrong pattern, which would mean that something went wrong in my pipelines… Could anybody give me her/his opinion about my plots?

Ben

SNPs_FS.pdf

SNPs_MQ.pdf

SNPs_QD.pdf

SNPs_MQRankSum.pdf

SNPs_ReadPosRankSum.pdf

From Geraldine_VdAuwera on 2016-07-21

Hi @benjaminpelissie,

My first observation is that your plots aren’t nearly as smoothed out as what we produce. Are you dealing with small numbers of variants? That could explain the different look overall.

Next, the ranksum test plots look very reasonable in overall distribution, and so do MQ and FS (for FS it can help to limit axis values to get more usable resolution). Aside from the jagged look, I think they’re fine.

It’s the QD plot that is the most divergent, and I’m a little puzzled by that. Is your organism non-diploid by any chance?

From benjaminpelissie on 2016-07-21

Thank you very much for your quick (and overall optimistic) answer. I am in the process of obtaining the number of variants, so I’ll be able to answer that very soon.

As for your last question, it is supposed to be diploid (it is quite well established), but maybe I can’t rule out the possibility of some weird (non-diploid?) populations in my cohorte, since the sampling is almost continent-scale. Would n>2 organisms typically produce such weird-looking plots? Could there be another explanation, like mapping or genotyping problems? Do you think if I make those plots per sample (not cohorte) I’ll be able to highlight potential weird sample in my cohort?

From Geraldine_VdAuwera on 2016-07-21

The reason why I bring up ploidy is because the distinct shape of the QD density curve comes from the difference in QD profile of heterozygous sites vs. homozygous sites (hom-var sites have twice as much read support for ALT so the QD tends to be ~2x higher than QD of hets). If you had a haploid organism you wouldn’t get such a pattern, and therefore the shape could be quite different — other patterns could emerge that are normally drowned out by the strong het/hom-var pattern — or you could just get a bell-shaped distribution if there is no other underlying pattern. Conversely, if you had a n>2-ploid you might see a more complicated, less obvious pattern related to the mix of genotypes, possibly depending on the state of population equilibrium.

You could definitely try breaking down the cohort to get a curve per sample (easy to do in R/ggplot by tying the fill aesthetic to the sample value). That would give you a first look at how variable the distribution is across samples. That should help decide your next move.

From benjaminpelissie on 2016-07-21

Thanks so much. I’m still amazed by the level of support you guys provide here…

From benjaminpelissie on 2016-07-22

@Geraldine_VdAuwera , I can’t manage to break down my cohort to produce per-sample curves. You advised to do it easily in R/ggplot, but my table do not include sample values, so I can not call it in the fill() argument. I tried to generate another table that would include such an information (about samples) but couldn’t find a way to do that with VariantsToTable (I tried adding -F SM but it didn’t give me exploitable results). I then tried to generate a VCF file for only one sample with SelectVariants (with the options -selectType SNP and -sn my_sample_name), then created the same type of table than the one generated for my cohorte, but QD plots proved to be exactly the same, which is puzzling to me… What am I missing?

Ben

From Sheila on 2016-07-26

@benjaminpelissie

Hi Ben,

The issue is that when you use SelectVariants to select single samples, the annotations from the cohort remain. So, you cannot produce single sample plots from the cohort VCF. One thing you can try is running GenotypeGVCFs on subsets of your 85 sample GVCFs and plot the QD for those subsets. That will help you narrow down which, if any, subsets have the “unsmooth” plot.

-Sheila

From benjaminpelissie on 2016-07-27

Hi Sheila,

Actually that is what I had in mind. Thanks for confirming it (it removes the doubt, which is great)!

Ben

From benjaminpelissie on 2016-08-15

Hi Sheila and Geraldine_VdAuwera.

I then re-genotyped per sample (20 out of 100), and plotted the QD distribution. It turns out the pattern seems to be the same throughout my cohorte (I am attaching the 3 QD plot “variation” that I obtained). So it seems that either I did something wrong with my pipelines which introduced this artefact, or it is a feature of my model’s genome. Do you think this pattern could come from repeated elements? If yes, how could I explore this hypothesis (was thinking about trying to mask repeated elements with repeatmasker)?

Ben

SNPs_11_QD.pdf

SNPs_90_QD.pdf

SNPs_01_QD.pdf

From Geraldine_VdAuwera on 2016-08-17

If the pattern is consistent then it’s more likely that it’s a feature of your organism’s genome — or its population genetics. I’m not sure about how much influence repeated regions would have on this, but it’s worth looking into repeat masking if you think the organism’s genome might have a lot of repeats or paralogous regions. If nothing else it might improve mapping and therefore the variant calls. We can’t really provide you with guidance on this since it’s well outside of scope for us, but I’d love to hear about how it turns out, if you’re willing to share.

From benjaminpelissie on 2016-08-17

Thank you very much for your opinion about it. I will share my findings (if any) on this thread, for sure. Cheers!

From Cathy on 2016-09-05

Hi Sheila and Geraldine_VdAuwera.

I am using VariantFiltration to build a personnal database to use the VQSR process on multispecies mouse sample.

I would like to know if it is rigorous to use a hard filter process, because I can skip some rare allele.

What are your advice.

Thank’s a lot.

Cheers

Cathy

From mscjulia on 2016-09-05

Hi,

To my understanding, this hard filtering is used only when VQSR cannot be applied to the data sets – is that correct please? I’m wondering because as I’m looking at my VQSR (with WGS best practice default setting) result:

chr1 404409 . T G 260.52 PASS AC=7;AF=0.700;AN=10;BaseQRankSum=0.431;ClippingRankSum=0.00;DP=16;ExcessHet=4.7712;FS=0.000;MLEAC=7;MLEAF=0.700;MQ=30.59;MQRankSum=-1.150e+00;QD=17.37;ReadPosRankSum=-4.310e-01;SOR=1.911;VQSLOD=-1.175e+01;culprit=MQ GT:AD:DP:GQ:PL 0/1:2,2:4:47:47,0,49 0/1:1,3:4:21:93,0,21 0/1:1,2:3:25:51,0,25 1/1:0,2:2:6:62,6,0 1/1:0,2:2:6:42,6,0 ./.:1,0:1:.:0,0,0

This position seems to have low read depth for all samples, and I wonder if I should apply hard filtering after VQSR just to exclude cases like this. (I also notice that for this position only, DP is not shown in the INFO column – why is that please?) Thanks a lot.

From Sheila on 2016-09-06

@mscjulia

Hi,

Yes, hard filtering should be used when you do not have the ability to use VQSR. You do not need to apply hard filters after VQSR. VQSR takes into account many annotations, so even if some annotation values do not look great, the other annotation values must have compensated for those not so good values.

The DP is reported in the INFO column (DP=16). In any case, we do not recommend filtering on depth, so your results are not affected.

-Sheila

From mscjulia on 2016-09-06

@Sheila Thank you very much for your reply. I’m so sorry I didn’t see the DP value in INFO column. Just a follow up question, we care about depth in this particular project for some experimental reasons, so if I would like to only look positions with good coverage (>10), any GATK tools can help me do some basic filtering before variant calling please? Thank you!

From Sheila on 2016-09-06

@Cathy

Hi,

Our recommendations for hard filtering are designed to be quite lenient. However, it is really up to you to decide how stringent you want to be. Keep in mind, the annotations will not penalize alleles that are found in only a few samples if there is good evidence for the variant.

Have a look at [this article](https://software.broadinstitute.org/gatk/documentation/article?id=6925) for ways to improve our recommendations to better suit your dataset.

-Sheila

From Sheila on 2016-09-07

@mscjulia

Hi,

You can use VariantFiltration. Although we do not recommend filtering on depth, it is better to wait until after variant calling to do filtering.

-Sheila

From ngs_medicine on 2016-10-06

@Sheila

I have another question regarding to the hard filtering!

After I have done the hard filtering I will have two filtered vcf files (one for snps, one for indels)

Then, I have to combine these two vcf files using CombineVariant function.

Then, our output file is the final file for annotation. Is it right? Using this combined vcf file, I can annotate with annova or snpSift.. etc.

IS it right??

From Sheila on 2016-10-06

@ngs_medicine

Hi,

That is correct :smile:

-Sheila

From sysuls on 2016-10-18

Hi,

It’s my first time to use gatk to call variance. I just followed the steps in the Best Practice. But I have a question using VariantFiltration that the values distribution of QD, FS, ReadPosRankSum are in different patterns from them in your documents. I get confused that what threshold should I choose. Could anybody give me her/his opinion about my plots?

(PS:each line represent a population in the plots.)

Saib

FS.pdf

QD.pdf

ReadPosRankSum.pdf

From Sheila on 2016-10-19

@sysuls

Hi Saib,

Okay. There can be many different explanations for why your plots deviate from ours. What type of samples are you working with? Human samples? Whole exomes or whole genomes? How many samples? What do you mean by “you followed Best Practices”? Can you tell us the exact tools you ran?

Thanks,

Sheila

From sysuls on 2016-10-20

@Sheila

Hi Sheila, thank you very much for your reply. I’m sorry that I don’t make myself clear.

I used Genome Analysis Toolkit (GATK v.3.30) for base quality recalibrations, insertion/deletion (indel) realignment, SNPs and indels discovery and genotyping across all 33 simples simultaneously accorded to GATK best practice recommendations. Meanwhile, SNPs and short indels against the reference were also called using samtools/bcftools v.0.1.19.

After selecting the overlap of SNPs called by UnifiedGenotyper in GATK and mpileup in samtools using SelectVariant. I’m confused what threshold should I choose to do the hardfilter because the values distribution of QD, FS, ReadPosRankSum are in different patterns from them in your documents. Each line in my plot represents a population

Thanks

Saib

From Will_Gilks on 2016-10-20

Hi @sysuls

Is that ggplot you’re using ? To obtain a better view of your data, firstly, format the GATK variants info table as ‘long’ using dplyr with the tidyr-gather function :

long.df = df > gather (metric, value, -c(CHROM, POS, ID))

Then either plot the value for each variant against genomic position using:

ggplot(long.df, aes(POS, value) + geom_point() + facet_grid(metric~chrom, scales = “free”)

This allows you to see where in the genome extreme values are, though may crash your computer depending on how many variants you have.

Or plot the distributions using:

ggplot(long.df, aes(POS, value) + geom_histogram() + facet_wrap(“metric”, scales = “free”)

A good trick is before reformatting the data frame is to assign groups for ‘allele frequency’, e.g. 0-0.1, 0.1-0.4, 0.4-0.5, and then colour the histogram bars accordingly. Also assigning a variant type is good, i.e. Bi-allelic or multi-allelic, SNP or indel = 4 groups – because different types of variants can have different properties in terms of sequencing and genotyping.

An extended example is hopefully available here https://zenodo.org/record/159282#.WAjGvNy8J7E

From Sheila on 2016-10-23

@sysuls

Hi Saib,

Are you working with human samples? What do you mean by “Each line in my plot represents a population”? Can you try plotting the variants from UnifiedGenotyper only, and not the intersection of two callers?

-Sheila

P.S. We highly recommend using HaplotypeCaller over UnifiedGenotyper.

From bbimber on 2017-03-16

Hello,

Is there an elegant way to simultaneously apply different filters to both SNPs and INDELs? For example, using these filters:

https://software.broadinstitute.org/gatk/guide/article?id=3225

Does anyone attempt more complicated JEXL, like:

(!vc.isIndel() && ReadPosRankSum < -8.0) || (vc.isIndel() && ReadPosRankSum < -20.0)

I didnt check the validity of my JEXL, but I assume something like this could work. The general idea is to skip subsetting the VCF into separate SNP/Indel files, but still apply different filters. Is there a reason this wouldnt work?

Thanks,

Ben

From Geraldine_VdAuwera on 2017-03-20

Hi @bbimber, what you propose should work. Assuming it does, I’m not sure why we don’t propose that rather than splitting out SNPs and Indels, to be honest. I’ll check with the team but it might just be an oversight on our part, since we normally don’t use hard filters ourselves.

From bbimber on 2017-03-20

OK. One would need to think carefully about the test used to separate indel/non-indel (this thread above talks about how to deal w/ sites with both indels and SNPs). The JEXL also would not be terribly friendly. Outside of this, it does seem like it should work.

From bbimber on 2017-04-04

One related question to this: I realize hard filtering isnt GATK’s preferred method to support; however, in your recommendations of filtering SNPs vs. Indels (https://software.broadinstitute.org/gatk/guide/article?id=3225), how do MNPs and MIXED sites fit into this?

I think I saw an example once in which you guys ran SelectVariants twice with selectType=SNP and selectType=Indel, which would mean MNPs and MIXED got dropped, right?

From Sheila on 2017-04-10

@bbimber

Hi,

metheuse asked this in October 2014 in the same thread (but you have to go to the first page). We don't have any explicit recommendations. Have a look at what Geraldine said. Perhaps metheuse can share his results.

-Sheila

From Juls on 2017-04-21

Sheila Geraldine_VdAuwera

I am using GATK 3.7 and I applied the hard filter as above but I get these warnings below – could you have a look?

Thank you!!

Best,

Julia

INFO 06:40:12,569 HelpFormatter – —————————————————————————————————————————

INFO 06:40:12,572 HelpFormatter – The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18

INFO 06:40:12,573 HelpFormatter – For support and documentation go to https://software.broadinstitute.org/gatk

INFO 06:40:12,573 HelpFormatter – [Thu Apr 13 06:40:12 CEST 2017] Executing on Linux XXX

INFO 06:40:12,573 HelpFormatter – Java HotSpot™ 64-Bit Server VM 1.8.0_121-b13

INFO 06:40:12,578 HelpFormatter – Program Args: -T VariantFiltration -R reference.fa -V justsnps.vcf.gz —filterExpression QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 —filterName top_snp_filter -o justsnps_filtered_keep.vcf

INFO 06:40:12,583 HelpFormatter – Executing as XXXX

INFO 06:40:12,583 HelpFormatter – Date/Time: 2017/04/13 06:40:12

INFO 06:40:12,584 HelpFormatter – —————————————————————————————————————————

INFO 06:40:12,626 GenomeAnalysisEngine – Strictness is SILENT

INFO 06:40:13,437 GenomeAnalysisEngine – Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000

WARN 06:40:13,847 IndexDictionaryUtils – Track variant doesn’t have a sequence dictionary built in, skipping dictionary validation

INFO 06:40:14,608 GenomeAnalysisEngine – Preparing for traversal

INFO 06:40:14,618 GenomeAnalysisEngine – Done preparing for traversal

INFO 06:40:14,619 ProgressMeter – [INITIALIZATION COMPLETE; STARTING PROCESSING]

WARN 06:40:15,026 Interpreter – ![59,73]: ‘QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0;’ undefined variable ReadPosRankSum

WARN 06:40:15,052 Interpreter – ![38,47]: ‘QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0;’ undefined variable MQRankSum

WARN 06:40:15,140 Interpreter – ![59,73]: ‘QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0;’ undefined variable ReadPosRankSum

WARN 06:40:15,168 Interpreter – ![38,47]: ‘QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0;’ undefined variable MQRankSum

WARN 06:40:15,170 Interpreter – ![38,47]: ‘QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0;’ undefined variable MQRankSum

From Sheila on 2017-04-25

@Juls

Hi Julia,

Those are nothing to worry about. They are just telling you that some annotations could not be calculated at some sites. You can look at the annotation documentation to see why they may not be calculated. Also, have a look at [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/4702/undefined-variable-mappingqualityranksum).

-Sheila

From Juls on 2017-04-26

Thank you!

From bbimber on 2017-04-26

we also see those warnings. if you want to eliminate them, one way is to test whether the VariantContext has that attribute:

(vc.hasAttribute(‘ReadPosRankSum’) && ReadPosRankSum < -8.0)

we use this on attributes that are expected to be missing some certain records. before adding this, i would probably double check that is it expected in your case.

From uyak on 2017-06-08

Hi,

I wonder if the fixed hard filter parameters recommendations listed on this page still hold for the version 3.7?

Thanks,

Uyak

From Sheila on 2017-06-12

@uyak

Hi Uyak,

They do still hold for human data. You may also be interested in this [article](https://software.broadinstitute.org/gatk/documentation/article?id=6925).

-Sheila

From Ahmadsam66 on 2017-06-21

I have ran SNP filtration with above parameters on my data but I have got ~350K snp on WES. How can decrease it ?

From Sheila on 2017-06-26

@Ahmadsam66

Hi,

That is a lot of SNPs for exome data. How did you generate the final VCF? Did you follow Best Practices?

Thanks,

Sheila

From mjtiv on 2017-08-25

Dear GATK,

I understand that after Hard Filtering you need to analyze the quality control plots as seen in this post: https://software.broadinstitute.org/gatk/documentation/article.php?id=6925 to verify the data was filtered correctly. However, I am slightly confused at how these plots were created in the first place, like how SNP density values were determined etc. Any insights into how to generate these plots from an initial VCF file would be greatly appreciated. Based on prior posts I think I might be missing something obvious.

Thank You

Joe

From Sheila on 2017-08-29

@mjtiv

Hi Joe,

Indeed, I did not put the actual commands in that article. If you look in the presentations section, the filtering tutorial has some example commands that will help.

-Sheila

EDIT: The latest tutorial is a bit more advanced, but if you look at the [2016 version](https://drive.google.com/drive/folders/0BwTg3aXzGxEDczFPaGt0eTVyMjA), some of the more basic commands are listed.

From mjtiv on 2017-08-31

Sheila thank you for posting that link! Both pdf files helped out a lot!

From Wes3985 on 2017-10-05

Hi, I am looking to filter variants using hard filters. I can run the following and it works fine:

`GenomeAnalysisTK \ -R $path_to_ref \ -T VariantFiltration \ -o $outpath/$sample$outExt \ —variant $vcf_file \ —clusterSize 3 \ —clusterWindowSize 35`

it produces an output vcf with the expected annotation indicating that a snp is to close to another snp. This also suggests that there is no problem with general running of GATK VariantFiltration. However when I run the following:

`GenomeAnalysisTK \ -R $path_to_ref \ -T VariantFiltration \ -V $vcf_file \ —filterExpression “QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0” \ —filterName “GATK_HARD_snp_filter” \ -o $outpath/$sample$outExt`

The program crashes with an error saying that I have not supplied a correct argument, I copied and pasted these line directly from the sections above which is why I am baffled that this error is produced, I have checked for typos and the obvious and nothing seems out of place. I am using gatk/3.4.46 and java/1.8.0_45. Is this a known bug in this version?

Also I would ideally like to run these 2 filters in 1 program, however I am unsure if the window filter would run before or after the hard filtering criteria? Or if it depends on the order I put the arguments in? Hard filter, then window filter is preferable.

Many thanks

From Sheila on 2017-10-09

@Wes3985

Hi,

Can you try the command that crashed with the latest version of GATK?

When you are using two filters, both filters will be applied to your VCF sites (probably in order of input), regardless of order. For example, if a site is filtered both by the clusterSize and “GATK_HARD_snp_filter” the FT field will reflect both filter names.

I hope that helps.

-Sheila

From Wes3985 on 2017-10-12

Thank you for responding Sheila, our cluster administrators have upgraded to gatk/3.8.0 and I have tried to run the scripts again. Unfortunately the same problem is still occurring? I have also tried things like changing the order of the arguments and changing the arguments names to shorthand and longhand versions, though (as expected) this made no difference. I also tried shortening the filter command to simple commands with only 1 filter option but the same error is still thrown. I’ve also tried changing the quotes around the filter statement and swapping double pipes for single ones but nothing seems to work. The errors are consistently statements like: ‘Invalid argument value ‘<’ at position 10’. I’m not sure if that helps?

From Sheila on 2017-10-16

@Wes3985

Hi,

Okay. Can you try running each filter separately? I am wondering if it is an issue with the ||?

Thanks,

Sheila

From ksi216 on 2017-10-18

Hi,

I keep getting the error —filterExpression: command not found when running the variant filtration. I have made a new script. Still unable to see my error. any suggestions ? I literally have it copied exactly as posted except for the reference fasta and the input vcf

From Sheila on 2017-10-19

@ksi216

Hi,

Weird. Can you try removing the `\` from the command?

Also, can you try typing it instead of copying and pasting it?

Thanks,

Sheila

From NH2A on 2018-02-12

Hi,

I’m working on exomes : should I filter my variants on SOR rather than on FS ?

Thanks in advance

From Sheila on 2018-02-14

@NH2A

Hi,

We recommend using both. Have a look at [this article](https://software.broadinstitute.org/gatk/documentation/article?id=11069). I think SOR helps in low coverage regions (where FS does not do the best job).

-Sheila

From RB_74 on 2018-03-20

Hi have question in the same line…

For filter snp and indels and i have used following filter

`—filterExpression ‘QD < 2.0 || FS > 60.0 || MQ < 20.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0’ \ —filterName “basic_snp_filter” \

I am getting error like @Wes3985.

Second question is

If snps are present only in one replicate i was wondering how could they pass the filtering criteria? Is there is any way to PASS filter snps which are present in majority of the replicates?

From shlee on 2018-03-20

Hi @RB_74,

Are you also copy-pasting the commands as @Wes3985 had done? I suggest typing out the commands, especially the portions with `-` dashes. Sometimes, depending on the source of the copy, these get changed to hyphens or other similar looking but not the same encoding.

> If snps are present only in one replicate i was wondering how could they pass the filtering criteria?

You are asking if a SNP variant site in one sample is sufficient to pass filtering criteria. Yes, you can have a high-quality single-sample variant callset based on high-quality high depth of coverage reads data.

> Is there is any way to PASS filter snps which are present in majority of the replicates?

If you’d like to filter for variant sites that are present in at least two or more samples, then you can use CombineVariants towards this. See [the command in section 1.2](https://software.broadinstitute.org/gatk/documentation/article?id=9183#1). The tool does not, however, differentiate types of variants (SNPs vs. indels) nor ALT allele. It only cares that the site is variant in `-minN 2` at least two samples in the callset. You can change the `-minN` parameter to any number.

From melis on 2018-05-23

Hello,

Thanks for the hard filtering recommendations, very helpful! Do you also have recommended thresholds for non-variant sites? Do you think we can simply use the same ones as in the SNPs? Thanks!

Best,

Melis

From Sheila on 2018-05-30

@Melis

Hi Melis,

Have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/comment/44981).

-Sheila

From changyexiao on 2018-06-12

Hi~ I have a question. Why filter all the snp with GT 1/1? These sites with GT 1/1 have no MQRankSum and ReadPosRankSum, but they are filtered through these hard filters. Are these sites not credible？ Or should we filter manually？Thank you.

From shlee on 2018-06-13

Could you please provide more context for your question @changyexiao?

From nute11a on 2018-06-29

Hello,

I was wondering if there was a way to filter out heterozygous SNP calls from a VCF file. I am working with a model that has a doubled haploid genome, so all calls should be homozygous. Any heterozygotes are a false positive and can be removed.

Thank you!

From shlee on 2018-07-02

Hi @nute11a,

Please check out the [VariantFiltration](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_filters_VariantFiltration.php) tool doc and the [JEXL article](https://software.broadinstitute.org/gatk/documentation/article.php?id=1255). I believe with GATK4, we’ve made it easier to filter on genotypes using VariantFiltration alone (without having to go to using JEXL). Let us know how it goes.

From nute11a on 2018-07-02

Thank you @shlee, I actually used a grep command to filter. I appreciate your help!

From nute11a on 2018-07-03

@shlee : Sadly, the grep command I used didn’t work. I have not had any luck with VariantFiltration yet. I tried:

-T VariantFiltration \ -R referenceFASTA \ -o outputVCF \ —variant inputVCF \ —genotypeFilterExpression “isHet == 1” \ —genotypeFilterName “GT”

^^ I noticed that this command put a PASS in homozygote genotypes and GT in heterozygote genotypes, which exactly what I wanted.

I followed VariantFiltration with: -T SelectVariants \ -R referenceFASTA \ -V inputVCF \ —excludeFiltered \ -o outputVCF \

BUT:

0/1:4,6:10:GT:99:0|1:19793178_C_A:240,0,171

0/0:6,0:6:PASS:0:.:.:0,0,163

For some reason, the heterozygote genotype was still there with the GT. I don’t think SelectVariants extracted the passing calls. Do you have any suggestions on what to try next? Thank you!!

From nute11a on 2018-07-03

@shlee : My apologies, I did this next and it worked:

-T SelectVariants \ -R referenceFASTA \ —variant inputVCF \ —setFilteredGtToNocall

Thank you for your patience! :-)

From shlee on 2018-07-03

Hi @nute11a,

I’m glad you got things working. I just wanted to sum things up for others who may land on this thread.

Let’s say `trio.vcf` has a record like so:

```

GT:AD:DP:GQ:PL

0/1:17,15:32:99:399,0,439 0/1:11,12:23:99:291,0,292 1/1:0,30:30:90:948,90,0

```

Here I’m updating your commands to use GATK4 instead of GATK3.

```

gatk VariantFiltration \

-V trio.vcf \

-O trio_VF.vcf \

—genotype-filter-expression “isHet == 1” \

—genotype-filter-name “isHetFilter“

```

Note from the [VariantFiltration tool document](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_filters_VariantFiltration.php#—genotype-filter-expression) other options to filter on the FORMAT (aka genotype call) field:

> We have put in convenience methods so that one can now filter out hets (“isHet 1"), refs ("isHomRef 1”), or homs (“isHomVar == 1”). Also available are expressions isCalled, isNoCall, isMixed, and isAvailable, in accordance with the methods of the Genotype object.

In the resulting `trio_VF.vcf`, our example record adds an `FT` field and becomes:

```

GT:AD:DP:FT:GQ:PL

0/1:17,15:32:isHetFilter:99:399,0,439 0/1:11,12:23:isHetFilter:99:291,0,292 1/1:0,30:30:PASS:90:948,90,0

```

We see that HET (`0/1`) genotype calls get a `isHetFilter` in the `FT` field and other genotype calls get a `PASS` in the genotype field.

Then running:

```

gatk SelectVariants \

-V trio_VF.vcf \

—set-filtered-gt-to-nocall \

-O trioGGVCF_VF_SV.vcf

```

changes the `GT` genotypes of the `isHetFiltered` genotype records to null or no call (`./.`) like so:

```

GT:AD:DP:FT:GQ:PL

./.:17,15:32:isHetFilter:99:399,0,439 ./.:11,12:23:isHetFilter:99:291,0,292 1/1:0,30:30:PASS:90:948,90,0

```

From shlee on 2018-07-05

For anyone following this thread, the above is now an official (How to) tutorial at https://gatkforums.broadinstitute.org/gatk/discussion/12350/. Please check the tutorial for updates.

From CNBers on 2018-09-11

> @Sheila said:

> Hi Wendy,

> Yes, you can use CombineVariants for this step.

> -Sheila

@Sheila

I want to ask how many samples are enough to do VQSR?

How do I plot the distribution of the parameters of hard-filtering ?

Best Ningbo

From Donne96 on 2018-10-31

Sheila shlee

To determine the thresholds of the hard-filters suited for my data I want to recreate some graphs as can be found in doc #6295. I’ve been reading to a number of forums now and I can see there are many people who have recreated such files for their data. However, since I’m relatively new to R I was wondering whether there is some sort of R code available that would help to give me a kickstart.

Thanks in advance,

Donné

From shlee on 2018-10-31

Hi @Donne96,

@KateN from our team has developed an R-script towards plotting annotations for the GATK hands-on filtering workshop tutorial. You can find this script in workshop materials prior to September (1809). A link to workshop materials is on the [Presentations page](https://software.broadinstitute.org/gatk/documentation/presentations).

From RSmithAg on 2018-11-10

Hi,

I was wondering what is the rational behind this threshold: QualByDepth (QD) 2.0.

How confident should I be of a QD between 2.8 and 5.52 in non-human samples? The rest of the variants called are all over 10.

Thanks in advance.

Report abuse