created by Geraldine_VdAuwera
on 2012-08-01
JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.
In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.
JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:
"QUAL > 30.0"
QUAL
is a key: the name of the annotation we want to look at30.0
is a value: the threshold that we want to use to evaluate variant quality against>
is an operator: it determines which "side" of the threshold we want to selectThe complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:
"MY_STRING_KEY == 'foo'"
You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:
"QUAL / DP < 10.0"
You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):
"QUAL > 30.0 && DP == 10"
where &&
is the logical "AND".
In the case where you want to select variants that have at least one of several conditions fulfilled, provide each expression separately:
"QD < 2.0" \ "ReadPosRankSum < -20.0" \ "FS > 200.0"
To be on the safe, do not use compound expressions with the logical "OR" ||
as a missing annotation will negate the entire expression.
You can also filter individual samples/genotypes in a VCF based on information from the FORMAT field. Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples. Note however that this does not affect the record's FILTER tag. This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, you can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception. We have put in convenience methods to enable filtering out heterozygous calls (isHet == 1), homozygous-reference calls (isHomRef == 1), and homozygous-variant calls (isHomVar == 1).
Sensitivity to case and type
You're probably used to case being important (whether letters are lowercase or UPPERCASE) but now you need to also pay attention to the type of value that is involved -- for example, numbers are differentiated between integers and floats (essentially, non-integers). These points are especially important to keep in mind:
QUAL
field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual
, qual
or whatever) in your JEXL expression.NumberFormatException
for numerical type mismatches).Complex queries
We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.
Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.
Introducing the VariantContext object
When you use SelectVariants with JEXL, what happens under the hood is that the program accesses something called the VariantContext, which is a representation of the variant call with all its annotation information. The VariantContext is technically not part of GATK; it's part of the variant
library included within the Picard tools source code, which GATK uses for convenience.
The reason we're telling you about this is that you can actually make more complex queries than what the GATK offers convenience functions for, provided you're willing to do a little digging into the VariantContext methods. This will allow you to leverage the full range of capabilities of the underlying objects from the command line.
In a nutshell, the VariantContext is available through the vc
variable, and you just need to add method calls to that variable in your command line. The bets way to find out what methods are available is to read the VariantContext documentation on the Picard tools source code repository (on SourceForge), but we list a few examples below to whet your appetite.
Using VariantContext directly
For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:
java -jar GenomeAnalysisTK.jar -T SelectVariants \ -R reference.fasta \ -V variants.vcf \ -select 'vc.getGenotype("NA12878").isHomRef()'
Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:
! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263
Using the VariantContext to evaluate boolean values
The classic way of evaluating a boolean goes like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants \ -R reference.fasta \ -V my.vcf \ -select 'DB'
But you can also use the VariantContext object like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants \ -R reference.fasta \ -V my.vcf \ -select 'vc.hasAttribute("DB")'
Using VariantContext to access annotations in multiallelic sites
The order of alleles in the VariantContext object is not guaranteed to be the same as in the VCF output, so accessing the AF by an index derived from a scrambled alleles array is dangerous. However! If we have the sample genotypes, there's a workaround:
java -jar GenomeAnalysisTK.jar -T SelectVariants \ -R reference.fasta \ -V multiallelics.vcf \ -select 'vc.hasGenotypes() && vc.getCalledChrCount(vc.getAltAlleleWithHighestAlleleCount())/(1.0*vc.getCalledChrCount()) > 0.1' -o multiHighAC.vcf
The odd 1.0 is there because otherwise we're dividing two integers, which will always yield 0. The vc.hasGenotypes()
is extra error checking. This might be slow for large files, but we could use something like this if performance is a concern:
java -jar GenomeAnalysisTK.jar -T SelectVariants \ -R reference.fasta \ -V multiallelics.vcf \ -select 'vc.isBiallelic() ? AF > 0.1 : vc.hasGenotypes() && vc.getCalledChrCount(vc.getAltAlleleWithHighestAlleleCount())/(1.0*vc.getCalledChrCount()) > 0.1' -o multiHighAC.vcf
Where hopefully the ternary expression shortcuts the extra vc
calls for all the biallelics.
Using JEXL to evaluate arrays
Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:
java -jar GenomeAnalysisTK.jar -T SelectVariants \ -R reference.fasta \ -V variants.vcf \ -select 'vc.getGenotype("NA12878").getAD().0 > 10'
If you would like to select sites where the alternate allele frequency is greater than 50%, you can use the following expression:
java -jar GenomeAnalysisTK.jar -T SelectVariants \ -R reference.fasta \ -V variants.vcf \ -select vc.getGenotype("NA12878").getAD().1 / vc.getGenotype("NA12878").getDP() > 0.50
Updated on 2018-12-14
From Geraldine_VdAuwera on 2014-09-01
Questions up to August 2014 have been moved to an archival thread:
http://gatkforums.broadinstitute.org/discussion/4554/questions-about-jexl-expressions-for-selecting-variants-according-to-specific-reuqirements
From shangqianxie on 2014-10-20
Hi Geraldine_VdAuwera, Thanks for the detail about JEXL expression .I used SelectVariants -select ‘vc.getGenotype(“SampleN”).isHet’ or .isHomVar for selecting the known genotype like “0/1 or 1/1”. But in my VCF file, there are some missing genotype like “.” , and I also want to select those unknow genotype. Can you give me some comments for this. Thanks.
Best,
Xie
From Sheila on 2014-10-20
@shangqianxie
Hi Xie,
To select the unknown genotypes (./.) you can select for anything that is not homref, het, or homvar.
-Sheila
From shangqianxie on 2014-10-21
Hi Sheila, thank you for you timely reply. I select by vc.getGenotype(‘sample08’).isnotHomVar() …, but there is error: “Invalid JEXL expression detected for select-0 with message ![72,85]: vc.getGenotype(‘sample08’).isnotHomVar() && vc.getGenotype(‘sample08’).isnotHet();’ unknown, ambiguous or inaccessible method isnotHomVar”
I think the problem is incorrect JEXL expression. So are there some JEXL expressions for select setting ?
Thanks,
Xie
From Sheila on 2014-10-21
@shangqianxie
Hi Xie,
Instead of using vc.getGenotype(‘sample08’).isnotHomVar(), try using
! vc.getGenotype(‘sample08’).isHomVar(). The ! means not.
-Sheila
From shangqianxie on 2014-10-23
Hi Sheila,
Thanks again for your reply, It makes sense.
Xie
From jullfly on 2015-01-14
Hi,
I want to be able to select the sites that have a DP >= 5 for all my samples. I have 24 samples, all of them “C??” (i.e. C01, C04, C12, etc). I tried using the * wildcard character like this: ‘vc.getGenotype(“C*”).getDP() >= 5’ , but this gave me an error. Is there a way to do this, or will I have to write the command individually for each sample (separated by &&)? Thanks for the help.
From Sheila on 2015-01-15
@jullfly
Hi,
Unfortunately, you cannot iterate over sample names using the variant context operations. You will have to make a script to run on each sample name.
-Sheila
From jullfly on 2015-01-15
Thanks for your response @Sheila. It’s ok, I can script it for all sample names but it’s just not as neat and elegant of a code as it would be if there was a way to iterate through. Oh well, it will do what I need it to do nevertheless!
From jullfly on 2015-01-19
Hi, sorry but I have another question. I keep getting errors with this filtering expression, and I can’t figure out what I am doing wrong.
(Note: this is a snippet of my script, $java, $GATK, and $ref variables are defined elsewhere in the script).
$java -Xmx2g -jar $GATK \ -T SelectVariants \ -R $ref \ —restrictAllelesTo BIALLELIC \ —variant Variants.SNPs.filtered1.vcf.C01.5.vcf \ —select_expressions ‘vc.getGenotype(“C01”).isHet()’ \ —select_expressions ‘((double)(vc.getGenotype(“C01”).getAD().0) / (vc.getGenotype(“C01”).getAD().1 + vc.getGenotype(“C01”).getAD().0)) > 0.25’ \ —select_expressions ‘((double)(vc.getGenotype(“C01”).getAD().0) / (vc.getGenotype(“C01”).getAD().1 + vc.getGenotype(“C01”).getAD().0)) < 0.75’ \ -o Variants.Het.SNPs.C01.5.vcf
This is the error message:
java.lang.IllegalArgumentException: Argument select-2has a bad value. Invalid expression used.
I double-checked my casting rules in Java for casting the int of AD into a double, and it seems fine to me. I even tested an analogous situation with simple numbers and it works.
boolean x = ((double) (1) / (2+3)) < 1; //x is true
Help would be greatly appreciated!
From Geraldine_VdAuwera on 2015-01-19
Hmm, I’m not sure JEXL allows you to do type casting — I’ve never seen it used in a JEXL expression. To be honest I would probably just dump the data using VariantsToTable and do the selection in R, which is more flexible for complex queries.
From jullfly on 2015-01-21
Thanks @Geraldine_VdAuwera, I will keep that option in mind as I continue filtering my variants. But actually I was able to get the above script to work by converting it to a float instead of a double. Others may find this useful:
vc.getGenotype(“C01”).getAD().0.floatValue()
From Geraldine_VdAuwera on 2015-01-21
Oh nice, glad to hear you got it to work. And thanks for reporting your solution for other users’ benefit.
From mmterpstra on 2015-01-30
Because this has moved a few times since my last post.
For people who can read (a little) java, here is the link to the implementation:
https://github.com/samtools/htsjdk/blob/c63464d69dea17b684257b7d981a302126871355/src/java/htsjdk/variant/variantcontext/VariantContext.java#
I’m not sure this link works, so alternatively go to:
https://github.com/samtools/htsjdk/ and find it below /src/java/htsjdk/variant/variantcontext/VariantContext.java
It has all the cool stuffs like vc.isPolymorphicInSamples() vc.isIndel()
From newbie16 on 2015-09-23
Hi
I have a vcf which in FORMAT field has a tag HQ which is an array of integers (e.g. 40,200)
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sampleIdGoesHere
1 815233 . TC CA 540 PASS AN=2;AC=1;RPM=L1M3|L1|31.4;SEGDUP=8 GT:VT:FT:HFT:GQ:HQ:PL:DP:AD 0/1:ref,delins:PASS:PASS,PASS:341:341,488:540,0,364:249:46,200
1 815521 . T . 358 PASS AN=2;END=815522 GT:VT:FT:HFT:GQ:HQ:DP:AD:PS 0|0:ref,ref:PASS:PASS,PASS:358:358,410:91:91:815520
I want to filter records such that either of the HQ values are >200
Is there a way to do that ?
From Sheila on 2015-09-24
@newbie16
Hi,
I think using JEXL should work. Have a look at this document: http://gatkforums.broadinstitute.org/discussion/1255/variant-filtering-methods-involving-jexl-queries The last section “Example using JEXL to evaluate arrays” should help.
-Sheila
From bbimber on 2015-10-29
Hello,
I’d like to perform site-level select or filtering on a VCF, using a criteria based on genotype attributes. Specifically, I’d like do things like select/filter sites where at least one genotype has depth > X, or where all genotypes have depth < Y. There is a post above about evaluating arrays, which I think is analogous to what I’m after. However, that example seems to be subsetting a specific genotype: ‘vc.getGenotype(“NA12878”).getAD().0 > 10’, as opposed to doing some sort of loop or aggregate (for example, max(DP) could accomplish what I need). Is this sort of thing possible? Thanks in advance.
-Ben
From Sheila on 2015-11-02
@bbimber
Hi Ben,
Unfortunately, you will need to to write your own script to do what you want. The evaluating arrays section refers only to evaluating annotations that are arrays of values (like AD).
-Sheila
From Geraldine_VdAuwera on 2015-11-04
@bbimber As a workaround I would recommend looking into VariantsToTable to output a table of the properties you’re interested in, set up your evaluation logic in an Rscript (or a civilized scripting language) and use that to output a list of positions that you want to filter or keep, then use that to SelectVariants the subset that you want. Bit roundabout but less risky than scripting something to parse the VCF directly, in my opinion.
From bbimber on 2016-03-14
That last comment about AD actually brings up another question. I am hoping to apply genotype filters based on AD. I was trying something like:
-G_filter “isHet && g && g.hasAd() && g.getAD().1 < 5”
I would have assumed the Genotype was in scope (like vc); however, after reviewing the code, I dont think that’s true. Could I confirm that?
Second, I think AD gets omitted. From what I can see, when the Genotype is put into the JEXLMap, a handful of fields are explicitly added, and then it defers to getExtendedAttributes(). However, if you look at FastGenotype, AD wont be part of getExtendedAttributes():
https://github.com/samtools/htsjdk/blob/bd92747fd3672635de96473fea6d4b38e8635c8e/src/java/htsjdk/variant/variantcontext/JEXLMap.java
Is this an oversight?
From Geraldine_VdAuwera on 2016-03-16
The syntax for getting at AD values (e.g. `-select ‘vc.getGenotype(“NA12878”).getAD().0 > 10’`) is given in the document above, see the last point in the doc.
From bbimber on 2016-03-16
sorry if the post wasnt clear, but i’m asking about genotype filters, not site-level filters. As noted above, from what I see in the code, the JEXLMap is populated through different codepaths.
doing a genotype-level filter on a hard-coded samplename wouldnt make sense, for example. i’m pretty sure ‘vc’ isnt put in scope for the genotype filters either.
From Geraldine_VdAuwera on 2016-03-16
Ah, I’m afraid we don’t currently provide support for using advanced JEXL in genotype filters, sorry. It’s just not something we use internally, so we’ve never explored this. What’s currently on this page is all we can offer to help with at this time. If you figure out a way to do what you want, please share it so that others may benefit.
From bbimber on 2016-03-16
yeah, that’s understandable. i wonder if you can answer this: the relevant code here is under HTS-JDK. It wouldnt be hard to submit a pull request to improve the situation (maybe genotype filters use a JEXLMap more like site filters). is this worth my time? will gatk4 use a similar or a completely different codepath?
From Geraldine_VdAuwera on 2016-03-16
I can’t judge whether it would be worth your time because I don’t know how much this capability is worth to you, but I can reassure you that GATK4 will continue to use htsjdk and use the same codepaths to enable JEXL queries for the foreseeable future. And if you do submit this with an example of how it can be leveraged from GATK, we will support your pull request (sometimes things get lost in htsjdk).
From Matteodigg on 2016-03-18
Hi,
I have a question for you. I have a multi-sample vcf. I’d like to apply a filter where a variant PASS if at least one sample in that position has AD/DP > 0.2
Is there any way to do it?
Thanks for your help and your great job!
Matteo
From bbimber on 2016-03-18
One slightly awkward workaround might be to do something like:
1) in the first pass apply a genotype filter. this could work with the caveat that I’m not sure you can get AD (see my above post)
2) use setFilteredGenotypeToNoCall (or whatever the parameter is called)
3) in pass 2, apply a site filter based on # samples called, # not-called, etc.
Or as a one off, I think you could use a JEXL expression that hard codes all your samples names and strings together a big OR expression:
-select ‘vc.getGenotype(“Sample1”).getAD().0 / vc.getGenotype(“Sample1”).getDP() > 0.2 || vc.getGenotype(“Sample2”).getAD().0 / vc.getGenotype(“Sample2”).getDP() > 0.2 … etc. ‘
I didnt check my JEXL syntax on the above, so it probably needs to be touched up. Hopefully the general idea makes sense.
From Matteodigg on 2016-03-21
@bbimer,
thank you for your tip, but it doesn’t work… Using that JEXL expression, all my sites results as FILTER and I have problems with homo GT field like 1/1 or 2/2. Just to be clear, I’d like to tag a site in my vcf as PASS if at least one sample has AD/DP > 0.2 for a variant, otherwise, FILTER.
Any other suggestions?
Thanks,
Matteo
From bbimber on 2016-03-21
Matteo – sorry, that expression was just typed into the browser, not tested in any way. a couple guesses:
1) if you want the first alternate allele, should it be AD.1, not .0?
2) you might want to include checks like isHet or isNoCall. A homozygous ref animal will only have one value for AD, i think.
3) could you rewrite your ratio to be based off the reference depth? every genotype should have a value for AD.0.
4) I think JEXL is pretty tolerant to null values (for example, asking for AD.1 on a homozygote), but I am not all that familiar with how these are handled. I’d just be careful. In my experience GATK is pretty verbose about logging issues w/ JEXL filters.
good luck.
From bbimber on 2016-03-21
here’s the PR for the change to improve genotype filtering. this should allow proper filtering on AD and other genotype-level attributes:
https://github.com/samtools/htsjdk/pull/532
From Geraldine_VdAuwera on 2016-03-22
Got it, thanks @bbimber.
From bbimber on 2016-03-22
BTW – you asked for an example. The key differences are:
1) you can access the genotype object in genotype filters:
-G_filter “isHet && g.hasAd() && g.getAD().1 < 5”
in my opinion, this is much more useful than the above example for site filters that hard-codes specific sampleIDs. When you want to ask question like ‘filter sites where X% of genotypes have Y’, you could do one pass where you perform genotype filters and setFilteredGtToNoCall. after this, do a site filter on N_CALLED.
2) It fixes when I think was a bug in the original implementation, which should have omitted AD + PL from the JEXL map. This is because they were not included in Genotype.getExtendedAttributes().
From Geraldine_VdAuwera on 2016-03-22
I agree that makes a lot of sense.
I would recommend include this example explicitly in your pull request thread as it will help convince the gatekeepers of htsjdk that this is worthwhile, in case they don’t look up this thread.
From lindenb on 2016-03-22
FYI : there is now a javascript-based engine in picard FilterVcf . It’s, IMHO, much more powerful than JEXL . See :
https://github.com/samtools/htsjdk/blob/master/testdata/htsjdk/variant/variantFilter02.js
for an example.
From Geraldine_VdAuwera on 2016-03-23
lindenb Ah I saw that go through into htsjdk but must have missed the Picard FilterVcf update. That does look very powerful (thanks for contributing that!) but I do think we should take in
bbimber’s PR to shore up our JEXL offering in GATK, as for reasonably simple queries the usage is more straightforward compared to writing javascript.
From bbimber on 2016-04-27
Hello – the PR to HTSJDK was accepted, which should allow much better genotype filtering using GATK. My question is: how frequently does GATK update the HTSJDK version it uses? If I read github correctly, GATK is currently pegged to htsjdk version 1.141. These changes are in 2.2.2. Am I reading this wrong? Thanks in advance.
From Geraldine_VdAuwera on 2016-04-27
That’s correct. We typically update htsjdk when we cut a new release, which nowadays is about 3 times a year — so you’re in luck because the next release, GATK 3.6 is slated for mid-May. In fact there’s a PR that just went in for updating htsjdk to 2.2.2 so it should be available in the nightly builds within a few days. Assuming all tests pass.
From bbimber on 2016-04-27
good news, thanks.
From u0597274 on 2016-10-10
Just thought I’d request a simple demo to filter out homopolymers.
From Sheila on 2016-10-13
@u0597274
Hi,
There is an annotation call [HomopolymerRun](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_HomopolymerRun.php) that you can use.
-Sheila
From everestial007 on 2017-01-02
I am getting the following error message: java -jar -Xmx16g /home/remington/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V 2ms01e.pHASER01.vcf -sn 2ms01e -o 2ms01e_only.pHASER01.vcf
`##### ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: For input string: ""
ERROR ------------------------------------------------------------------------------------------
`
It used to work perfectly. One of the error suggests it might be a bug, so reporting.
From Sheila on 2017-01-03
@everestial007
Hi,
Can you please submit a bug report so I can test this locally? Instructions are [here](https://software.broadinstitute.org/gatk/documentation/article?id=1894).
Thanks,
Sheila
From MBK on 2017-01-17
I am wondering whether -select parameter for SelectVariants can be used to filter variants based on the variables in the FILTER field. Something like -select “FILTER 'PASS' && FILTER ‘foo’ && FILTER == ‘bar’”. Tested on the latest version of the suite, but does not work. Any suggestions?
From Sheila on 2017-01-18
@MBK
Hi,
I think the best thing to do is use [`—setFilteredGtToNocall`](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php#—setFilteredGtToNocall), then use either [`—minFilteredGenotypes`](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php#—minFilteredGenotypes) or [`—minFractionFilteredGenotypes`](https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php#—minFractionFilteredGenotypes).
-Sheila
From frankdu on 2017-01-25
Hi,
I use JEXL in “-select” to run SelectVariants. `vc.getGenotype(“SAMPLE”).getAD().1 > 100` and `vc.getGenotype(“SAMPLE”).getDP() > 2000` work fine. But `vc.getGenotype(“SAMPLE”).getAD().1 / vc.getGenotype(“SAMPLE”).getDP() > 0.01` never works and always outputs a vcf only containing the vcf headers and column headers. I am very sure some variants should be selected by above expression. I am using GATK3.7. Any possible reason for this ? Your help is much appreciated.
thanks,
Frank
From Sheila on 2017-01-26
@frankdu
Hi Frank,
It looks like SelectVariants with that command will not work on sites that have missing DP or DP=0. The best thing to do is remove all sites that have missing DP or DP=0, then try running the command again.
Let us know if that works.
Thanks
Sheila
From frankdu on 2017-01-26
Hi Sheila,
I checked sample format field, found no missing DP and all DP values are non-zero. BTW, the input variant vcf was generated by running GATK `HaplotypeCaller`.
thanks,
Frank
From Sheila on 2017-01-30
@frankdu
Hi Frank
Hmm. How many sites are you running on? Can you post some records that you think should be selected but are not?
Thanks,
Sheila
From frankdu on 2017-02-06
Sorry Sheila. I was distracted by some other tasks. Here are two examples that should be selected but not:
First
>9 5000000 . G T 1747.19 . AC=1;AF=0.250;AN=4;BaseQRankSum=-0.452;ClippingRankSum=-0.633;DP=3598;FS=0.000;MLEAC=1;MLEAF=0.250;MQ=59.36;MQRankSum=2.648;QD=0.79;ReadPosRankSum=5.331;SOR=0.644 GT:AD:DP:GQ:PL 0/0:1378,3:1381:99:0,4079,48391 0/1:1967,245:2212:99:1777,0,63806
Second
>17 7000000 . C T 16949.19 . AC=1;AF=0.250;AN=4;BaseQRankSum=1.629;ClippingRankSum=2.326;DP=4149;FS=0.000;MLEAC=1;MLEAF=0.250;MQ=57.99;MQRankSum=2.073;QD=6.25;ReadPosRankSum=9.251;SOR=0.729 GT:AD:DP:GQ:PL 0/0:1415,1:1416:99:0,4207,49936 0/1:1995,690:2685:99:16979,0,63586
From Sheila on 2017-02-07
@frankdu
Hi Frank,
Can you try running your SelectVariants command on just those sites with `-L`? I may need you to submit a bug report.
Thanks
Sheila
From benjamincjackson on 2017-03-01
I’d like to use vc.isPolymorphicInSamples() to get rid of monomorphic sites in a VCF. By monomorphic I mean that there is only one allele in all of the samples I consider, and I do not care what this allele is (REF or (any single) ALT allele).
(Another way of expressing what I am trying to get at is that, for a site, if the array ‘AF’ is of length 1, then its value != 1.00)
I first thought that I could just do this:
java -jar GenomeAnalysisTK.jar \ -T SelectVariants \ -R ref.fa \ -V in.vcf.gz \ -o out.vcf.gz \ -select ‘vc.isPolymorphicInSamples()’
However, as far as I now understand from the method description (and partly from the code, but disclaimer: I cannot program in Java) this method would not select a site for removal if all the samples have genotypes 1/1 (or all have 2/2, or all have 3/3, etc). This is because the test that the method is implementing is whether or not the total number of ALT alleles among the samples == 0.
From the vc source code (lines 1020-1040 from https://sourceforge.net/p/picard/code/HEAD/tree/trunk/src/java/org/broadinstitute/variant/variantcontext/VariantContext.java#l1041):
/** * Genotype-specific functions — are the genotypes monomorphic w.r.t. to the alleles segregating at this * site? That is, is the number of alternate alleles among all fo the genotype 0? * * @return true if it's monomorphic */ public boolean isMonomorphicInSamples() { if ( monomorphic null ) monomorphic = ! isVariant() || (hasGenotypes() && getCalledChrCount(getReference()) == getCalledChrCount()); return monomorphic; } /** * Genotype-specific functions — are the genotypes polymorphic w.r.t. to the alleles segregating at this * site? That is, is the number of alternate alleles among all fo the genotype > 0? * * @return true if it’s polymorphic */ public boolean isPolymorphicInSamples() { return ! isMonomorphicInSamples(); }
The function isMonomorphicInSamples() actually tests for a site which is monomorphic for the REF allele, but in the case of a site which is monomorphic for an ALT allele in the samples, it would not consider the site monomorphic, right?
Then because isPolymorphicInSamples() is !isMonomorphicInSamples(), this means that it considers sites as polymorphic when they are not actually segregating in the sample.
It seems to me that these functions would be doing something closer to what their names suggest they should be doing if the ‘InSamples’ part of the name was excised.
I really hope that I haven’t fundamentally misunderstood something here.
Thanks very much for your time,
Ben
From Sheila on 2017-03-06
@benjamincjackson
Hi Ben,
I think you only want sites where all samples are het in the VCF. In this case, I think the best thing to do is loop through all the samples in your VCF using `‘vc.getGenotype(“sample”).isHet()’`.
-Sheila
From benjamincjackson on 2017-03-06
Hi Sheila,
I want(ed) to get sites where overall, there are two alleles amongst the samples. All the samples could be homozygous, as long as at least one sample was homozygous for an ALT allele.
I got what I wanted by using VariantsToTable to make a table of genotypes, performing the necessary logic in R, then writing a .list file of sites to exclude and giving that to SelectVariants.
I just thought that the isPolymorphicInSamples() method was misnamed a bit, but perhaps this is a philosophical disagreement stemming from my population-genetics-oriented point of view.
From ronlevine on 2017-03-20
@benjamincjackson
Ben-
`isMonomorphicInSamples` – the site has no alternate alleles (ALT = .) OR if there are genotypes, there are no samples with alternate alleles (for 2 samples, the genotypes are 0/0 0/0).
`isPolymorphicInSamples` = NOT `isMonomorphicInSamples` = the site has alternate alleles AND if there are genotypes, there are samples with an alternate allele.
The [Variant Call Format](http://www.internationalgenome.org/wiki/Analysis/vcf4.0/) definition of monomorphic is a site does not contain a variant (no alternate alleles and no genotypes with alternate alleles). You were expecting the population genetics definition, where all of the genotypes would have the same allele (for 1 ALT allele and 2 samples, 0/0 0/0 or 1/1 1/1).
If you want to make sure there is homozygous variant genotype and all of the genotypes are homozygous. You can use the following JEXL expression:
`vc.getHomVarCount() != 0 && vc.getHomVarCount() + vc.getHomRefCount() == vc.getCalledChrCount()`
-Ron
From everestial007 on 2017-05-31
HI,
I spent almost all day today working out several filtration procedures on multisample vcf. Generally filtration based on single sample (using JEXL queries) is explained but examples on the filtration of vcf sites/lines where GT (genotype) information is computed from several samples to run filtration/selectvariants are not quite detailed.
I just thought about writing those examples here so other people don't have to pull as much hair, Lol !.
#### Tutorials .... !!! # method to call the sites that are all homozygous Variants (1/1, 2/2 genotype) sample size in vcf is 6 java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() == 6' # outcome: no sites will have ambigous (./.) and/or homRef (0/0). All other homVar are allowed # in the above script if we want the site where at least 4 samples are homozygous ref or variants java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() >= 4' # outcome: atleast 4 samples are homVar, rest can be any genotype (./., 0/0, 0/1, etc.) # at least 1 sample is homVar (1/1, 2/2 or 3/3) java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() != 0' # all the samples that are homRef (0/0 genotypes) java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomRefCount() == 6' ## Note to remember: 6 chr = 3 samples (when samples are diploid) # sites that are all HomVar or nocalls (./.) java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() == vc.getCalledChrCount()/2' # outcome: no 0/0, 0/1 # Note: samples with GT = ./. is not counted as called chromosome ## Useful after samples are selected from multisample vcfs # sites where no genotypes were called for all the samples in the selected vcf # select vcf sites that are not called in any sample (i.e don't have any genotypes, only contain ./.) java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() == 0' # outcome = vcf sites (lines) that are all no calls (GT = ./.) on all samples are selected # samples are either homRef (0/0) or no call (./.) java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomRefCount() == vc.getCalledChrCount()/2' # outcome: vcf sites where all the samples are GT = ./. (no call) or GT = 0/0 # site where 6 chromosomes (exact) or 3 samples were called for diploid genome java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() == 6' # Note of Caution: 6 chr = 3 samples (when samples are diploid) # site where at least 3 samples (diploid) are called java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() >=6' # if I have 6-sample vcf java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getCalledChrCount() ==12' # 12 haploid chr = 6 samples (diploid) # outcome: sites that were called in all samples, there will be no GT = ./. # select vcf sites where no sample has hetVar (no GT = 0/1, 0/2), only GT = ./., 0/0, 1/1, 2/2 java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHetCount() == 0' # select vcf sites where no sample has hetVar (no Gt = 0/1, 0/2), and at least one sample is HomVar (GT = 1/1) java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHetCount() == 0 && vc.getHomVarCount() >= 1' # site where at least 1 sample is homVar (GT = 1/1 and/or 2/2 etc.) java -jar /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.Final.vcf -o My.filtered_01.vcf -select 'vc.getHomVarCount() >= 1'
*If any mistake gets introduced I won't be able to change it. So, please follow: *
https://sourceforge.net/p/picard/code/HEAD/tree/trunk/src/java/org/broadinstitute/variant/variantcontext/VariantContext.java#l55
Other updates may be followed @ https://www.biostars.org/p/255362/.
Thanks,
From everestial007 on 2017-05-31
ronlevine
Geraldine_VdAuwera :smiley:
I am having some trouble finding an appropriate vc JEXL queries. I want the vcf lines/sites that have same genotypes in all samples (including or excluding `no call GT = ./.`).
Any suggestions.
From bbimber on 2017-05-31
There’s probably some JEXL to do that; however, sometimes using VariantsToTable to make a TSV and filtering from there is easier. You could do that to find the whitelist of allowable sites. If you need this as VCF, you could write this as a list of intervals and pass that to SelectVariants using the -L argument. That’ll give you a VCF with only those sites.
From lindenb on 2017-05-31
using vcfilterjs http://lindenb.github.io/jvarkit/VCFFilterJS.html ( or picard FilterVcf)
java -jar dist/vcffilterjs.jar -e ‘function accept(vc) {for(var i=1;i
From everestial007 on 2017-05-31
@lindenb : I tried running this with picard. No succes, looks like I am messing up the script. Any more suggestions on how to use Picard with your expression.
From everestial007 on 2017-05-31
@lindenb : I was able to now run it using `Picard FilterVCF JS` but there didn’t pan out as expected. Thank you though !
From everestial007 on 2017-05-31
@lindenb : I was able to now run it using `Picard FilterVCF JS` but there didn’t pan out as expected. Thank you though !
From lindenb on 2017-05-31
@everestial007
works fine for me, (it’s not a hard filter)
$ cat NOT_SAME_GT.js function accept(vc) {for(var i=1;i
From mjtiv on 2017-06-16
Question about a JEXL command (filtering on depth of coverage across all samples for a variant)
“-filtername DP -filter “DP< 100”,
If I am understanding all the filtering information correctly this command will sum all the DP values for each sample for a variant and remove the variants with <100 total DP values .
Is this a correct assessment of how this command will work?
From Sheila on 2017-06-21
@mjtiv
Hi,
There are two depth annotations. Have a look at [this doc](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_annotator_Coverage.php) for the INFO annotation. Have a look at [this doc](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_annotator_DepthPerSampleHC.php) for the FORMAT annotation.
If you would like to filter on the INFO annotation, you would use [`—filterExpression`](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.5-0/org_broadinstitute_gatk_tools_walkers_filters_VariantFiltration.php#—filterExpression). If you would like to filter on the FORMAT annotation, you would use [`—genotypeFilterExpression`](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.5-0/org_broadinstitute_gatk_tools_walkers_filters_VariantFiltration.php#—genotypeFilterExpression).
-Sheila
From mjtiv on 2017-06-22
O.k., if I am understanding things correctly I need to change my code to: -filtername DP -filterExpression “DP < 100” to filter on a minimum coverage of a variant across all samples?
From Sheila on 2017-06-27
@mjtiv
Hi,
Yes, that is correct.
-Sheila
From MattB on 2017-07-03
Hi,
I’m tying to select out sites were the alt allele AD count is higher than 10, the problem is these are not always called as heterozygous, (I have high depth sequencing and I’m using the HC in classic single sample mode with `—output_mode EMIT_ALL_SITES`) e.g. in this example:
3 187446740 . T . Infinity . AN=2;DP=1095;MQ=60.00 GT:AD:DP 0/0:1095:1095 3 187446741 . C . Infinity . AN=2;DP=1117;MQ=60.00 GT:AD:DP 0/0:1117:1117 3 187446752 . A . Infinity . AN=2;DP=1297;MQ=60.00 GT:AD:DP 0/0:1297:1297 3 187446763 . C . Infinity . AN=2;DP=1494;MQ=60.00 GT:AD:DP 0/0:1494:1494 3 187451574 . C . Infinity . AN=2;DP=1493;MQ=60.00 GT:AD:DP 0/0:1493:1493 3 187451609 rs1880101 A G 39794.03 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.859;ClippingRankSum=0.000;DB;DP=995;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=24.56;ReadPosRankSum=0.406;SOR=8.234 GT:AD:DP:GQ:PL 1/1:1,988:989:99:39808,2949,0 4 1803279 . T G 0 . AC=0;AF=0.00;AN=2;BaseQRankSum=-6.652;ClippingRankSum=0.000;DP=245;ExcessHet=3.0103;FS=89.753;MLEAC=0;MLEAF=0.00;MQ=59.97;MQRankSum=0.000;ReadPosRankSum=-2.523;SOR=6.357 GT:AD:DP:GQ:PL 0/0:211,23:234:99:0,364,6739 4 1803307 rs2305183 T C 2486.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-5.049;ClippingRankSum=0.000;DB;DP=215;ExcessHet=3.0103;FS=1.110;MLEAC=1;MLEAF=0.500;MQ=59.97;MQRankSum=0.000;QD=11.95;ReadPosRankSum=-0.045;SOR=0.809 GT:AD:DP:GQ:PL 0/1:116,92:208:99:2494,0,3673 4 1803671 . C A 0 . AC=0;AF=0.00;AN=2;BaseQRankSum=-0.880;ClippingRankSum=0.000;DP=450;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQRankSum=0.000;ReadPosRankSum=-0.953;SOR=0.572 GT:AD:DP:GQ:PL 0/0:445,2:447:99:0,1272,15958 4 1803681 . T C 0 . AC=0;AF=0.00;AN=2;BaseQRankSum=-1.654;ClippingRankSum=0.000;DP=483;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQRankSum=0.000;ReadPosRankSum=-0.422;SOR=0.664 GT:AD:DP:GQ:PL 0/0:479,2:481:99:0,1408,18538 4 1803703 . A G 0 . AC=0;AF=0.00;AN=2;BaseQRankSum=-1.704;ClippingRankSum=0.000;DP=458;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;MQRankSum=0.000;ReadPosRankSum=0.299;SOR=0.497 GT:AD:DP:GQ:PL 0/0:454,2:456:99:0,1325,18095 4 1803704 rs2234909 T C 6676.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.605;ClippingRankSum=0.000;DB;DP=456;ExcessHet=3.0103;FS=1.753;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=14.71;ReadPosRankSum=0.324;SOR=0.849 GT:AD:DP:GQ:PL 0/1:220,234:454:99:6684,0,6366 4 1803824 rs2305184 C G 2030.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=8.083;ClippingRankSum=0.000;DB;DP=124;ExcessHet=3.0103;FS=6.128;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=16.51;ReadPosRankSum=0.180;SOR=0.096 GT:AD:DP:GQ:PL 0/1:62,61:123:99:2038,0,1766 4 1805296 rs3135883 G A 3876.03 . AC=2;AF=1.00;AN=2;DB;DP=110;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.22;SOR=9.401 GT:AD:DP:GQ:PL 1/1:0,109:109:99:3890,326,0
I have no interest in the lines were there is no AD count for anything other than the REF, however I’d to select lines 6,7,8,12,13 and 14 as these are a variety of het and hom calls all of which have an alt allele AD count greater than 10, I can’t however figure out how to select this using JXEL? Any suggestions? I’ve attached a cut-down VCF file of above.
From MattB on 2017-07-05
Just to update you, there are now both [vcfilterjs](http://lindenb.github.io/jvarkit/VCFFilterJS.html) (from Pierre Lindenbaum) and a bcftools (less optimal) solutions to this issue on [bioinformatics stack exchange](https://bioinformatics.stackexchange.com/questions/974/selecting-sites-from-vcf-which-have-an-alt-ad-10) (it’s in Beta so please give it a go!), in case anyone arriving here has the same issue.
From Sheila on 2017-07-24
@MattB
Hi Matt!
Sorry for the delay, but we were preparing for the workshop :smile:
If you look at the bottom of [this document](https://software.broadinstitute.org/gatk/documentation/article.php?id=1255) under “Using JEXL to evaluate arrays”, you should get what you are looking for. You just need to substitute in `-select ‘vc.getGenotype(“NA12878”).getAD().1 > 10’` for `-select ‘vc.getGenotype(“NA12878”).getAD().0 > 10’`.
-Sheila
P.S. Nice to meet you there and see you here :smiley:
From MattB on 2017-07-31
Hi Sheila, thanks, I always try to keep my pipeline pure GATK/Picard if possible! What’s the difference here in the `getAD().1 > 10` to the `getAD().0` is .0 only referring to reference calls? And would I need to run `getAD().2` to get the second alternative allele and so on?
Also my aim here is to recover potentially heterozygous sites which have not been called in a high-depth panel before feeding them into VEP to annotate them, since VEP will only annotate the “called” 0/1 or 1/1 genotypes is there a way of using said filter to alter / refine the genotype of 0/0 to 0/1 for such cases were AD of an alt allele is > 10, other than resorting to some kind of grep / sed / awk mashup? I Appreciate this is an odd / edge usage case.
P.S. Very nice to meet you all in sunny Edinburgh! :smiley:
From Sheila on 2017-08-10
@MattB
Hi Matt,
Yes, the .0 refers to the reference allele; the .1 refers to the first alternate allele; the .2 refers to the second alternate allele, and so on.
As for actually altering the genotype call, I don’t think there is any way to do that unless you manually alter them yourself (which we don’t recommend). You can try the [Genotype Refinement Workflow](https://gatkforums.broadinstitute.org/gatk/discussion/4723/genotype-refinement-workflow) to see if that changes any of the genotypes after additional information is available.
-Sheila
From manim on 2017-10-20
Hi,
I am having trouble with this command. I have a vcf file with 100’s of samples. And I want to apply sample level filter on allele frequency which is AD/DP. I have been trying it with this command but no luck. It marks all entries as filtered. When I simply tried with AD.0 >10, the command works as expected which tells me that it recognized AD.0 as number of reference reads. But somehow the division does not give me the right answer. Can you help?
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R .ucsc_hg19.fa -V test.vcf -o test.vf.vcf -G_filter “(AD.0) / DP >0.8” —genotypeFilterName ‘lowAD’
From Sheila on 2017-10-24
@manim
Hi,
You need to use `vc.getGenotype(“sample”).getAD().0` instead of `AD.0`. Have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/7908/filtering-vcf-help) as well.
-Sheila
From manim on 2017-10-26
Hi @Sheila, Thank you. Let me try those.
From everestial007 on 2017-12-20
I am getting a bug error
on 3.8 but not on 3.7 for the same ditto code. I am trying to select the variants that are Hets in all called samples.
On 3.8:
$ java -jar /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V DNA_Samples.Passed_SNPs.vcf -o DNA_Samples.Passed.SNPs_allHetsRemoved.vcf -select 'vc.getHetCount() == vc.getCalledChrCount()/2' -invertSelect
! INFO 16:22:21,892 HelpFormatter - ---------------------------------------------------------------------------------- ! INFO 16:22:21,894 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50 ! INFO 16:22:21,894 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute ! INFO 16:22:21,894 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk ! INFO 16:22:21,894 HelpFormatter - [Wed Dec 20 16:22:21 EST 2017] Executing on Linux 4.4.0-101-generic amd64 ! INFO 16:22:21,895 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0151-8u151-b12-0ubuntu0.16.04.2-b12 ! INFO 16:22:21,897 HelpFormatter - Program Args: -T SelectVariants -R lyratagenome.fa -V DNASamples.PassedSNPs.vcf -o DNASamples.Passed.SNPsallHetsRemoved.vcf -select vc.getHetCount() == vc.getCalledChrCount()/2 -invertSelect ! INFO 16:22:21,899 HelpFormatter - Executing as everestial007@everestial007-Inspiron-3647 on Linux 4.4.0-101-generic amd64; OpenJDK 64-Bit Server VM 1.8.0151-8u151-b12-0ubuntu0.16.04.2-b12. ! INFO 16:22:21,900 HelpFormatter - Date/Time: 2017/12/20 16:22:21 ! INFO 16:22:21,900 HelpFormatter - ---------------------------------------------------------------------------------- ! INFO 16:22:21,900 HelpFormatter - ---------------------------------------------------------------------------------- ! ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties ! ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console... ! INFO 16:22:22,010 GenomeAnalysisEngine - Deflater: IntelDeflater ! INFO 16:22:22,010 GenomeAnalysisEngine - Inflater: IntelInflater ! INFO 16:22:22,010 GenomeAnalysisEngine - Strictness is SILENT ! INFO 16:22:22,292 GenomeAnalysisEngine - Downsampling Settings: Method: BYSAMPLE, Target Coverage: 1000 ! ##### ERROR -- ! ##### ERROR stack trace ! java.lang.RuntimeException: java.lang.reflect.InvocationTargetException ! at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:187) ! at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:165) ! at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadFromDisk(RMDTrackBuilder.java:375) ! at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.attemptToLockAndLoadIndexFromDisk(RMDTrackBuilder.java:359) ! at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:319) ! at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:264) ! at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:153) ! at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208) ! at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88) ! at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:1071) ! at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:851) ! at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:294) ! at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123) ! at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256) ! at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158) ! at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108) ! Caused by: java.lang.reflect.InvocationTargetException ! at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ! at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ! at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ! at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ! at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:181) ! ... 15 more ! Caused by: java.lang.IllegalArgumentException: Illegal character in path at index 80: /media/everestial007/SeagateBackup4.0TB/RNAseqDataAnalyses/02AlignmentToRef - New/GVCFCallsonGenes-RunningTest01/02-JointGenotypingMySpF1/DNASamples.PassedSNPs.vcf ! at java.net.URI.create(URI.java:852) ! at htsjdk.samtools.util.IOUtil.getPath(IOUtil.java:984) ! at htsjdk.tribble.index.AbstractIndex.readHeader(AbstractIndex.java:305) ! at htsjdk.tribble.index.AbstractIndex.read(AbstractIndex.java:397) ! at htsjdk.tribble.index.linear.LinearIndex.(LinearIndex.java:119) ! ... 20 more ! Caused by: java.net.URISyntaxException: Illegal character in path at index 80: /media/everestial007/SeagateBackup4.0TB/RNAseqDataAnalyses/02AlignmentToRef - New/GVCFCallsonGenes-RunningTest01/02-JointGenotypingMySpF1/DNASamples.PassedSNPs.vcf ! at java.net.URI$Parser.fail(URI.java:2848) ! at java.net.URI$Parser.checkChars(URI.java:3021) ! at java.net.URI$Parser.parseHierarchical(URI.java:3105) ! at java.net.URI$Parser.parse(URI.java:3063) ! at java.net.URI.(URI.java:588) ! at java.net.URI.create(URI.java:850) ! ... 24 more ! ##### ERROR ------------------------------------------------------------------------------------------ ! ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836): ! ##### ERROR ! ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ! ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ! ##### ERROR Visit our website and forum for extensive documentation and answers to ! ##### ERROR commonly asked questions https://software.broadinstitute.org/gatk ! ##### ERROR ! ##### ERROR MESSAGE: java.lang.reflect.InvocationTargetException ! ##### ERROR ------------------------------------------------------------------------------------------
On 3.7
$ java -jar /home/evere stial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V DNA_Samples.Passed_SNPs.vcf -o DNA_Samples.Passed.SNPs_allHetsRemoved.vcf -select 'vc.getHetCount() == vc.getCalledChrCount()/2' -invertSelect
! INFO 16:23:11,550 HelpFormatter - -------------------------------------------------------------------------------- ! INFO 16:23:11,552 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 ! INFO 16:23:11,552 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute ! INFO 16:23:11,553 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk ! INFO 16:23:11,553 HelpFormatter - [Wed Dec 20 16:23:11 EST 2017] Executing on Linux 4.4.0-101-generic amd64 ! INFO 16:23:11,553 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0151-8u151-b12-0ubuntu0.16.04.2-b12 ! INFO 16:23:11,557 HelpFormatter - Program Args: -T SelectVariants -R lyratagenome.fa -V DNASamples.PassedSNPs.vcf -o DNASamples.Passed.SNPsallHetsRemoved.vcf -select vc.getHetCount() == vc.getCalledChrCount()/2 -invertSelect ! INFO 16:23:11,560 HelpFormatter - Executing as everestial007@everestial007-Inspiron-3647 on Linux 4.4.0-101-generic amd64; OpenJDK 64-Bit Server VM 1.8.0151-8u151-b12-0ubuntu0.16.04.2-b12. ! INFO 16:23:11,560 HelpFormatter - Date/Time: 2017/12/20 16:23:11 ! INFO 16:23:11,561 HelpFormatter - -------------------------------------------------------------------------------- ! INFO 16:23:11,561 HelpFormatter - -------------------------------------------------------------------------------- ! INFO 16:23:11,593 GenomeAnalysisEngine - Strictness is SILENT ! INFO 16:23:11,939 GenomeAnalysisEngine - Downsampling Settings: Method: BYSAMPLE, Target Coverage: 1000 ! INFO 16:23:12,180 GenomeAnalysisEngine - Preparing for traversal ! INFO 16:23:12,182 GenomeAnalysisEngine - Done preparing for traversal ! INFO 16:23:12,183 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] ! INFO 16:23:12,183 ProgressMeter - | processed | time | per 1M | | total | remaining ! INFO 16:23:12,184 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime ! INFO 16:23:12,792 SelectVariants - 538 records processed. ! INFO 16:23:12,826 ProgressMeter - done 538.0 0.0 s 19.9 m 53.9% 0.0 s 0.0 s ! INFO 16:23:12,826 ProgressMeter - Total runtime 0.64 secs, 0.01 min, 0.00 hours ! ------------------------------------------------------------------------------------------ ! Done. There were no warn messages. ! ------------------------------------------------------------------------------------------
From shlee on 2017-12-22
Hi @everestial007,
I’m not sure if this is related but could you see if adding `-jdk_deflater` and -`jdk_inflater` to the erring v3.8 command allows the run to succeed? These parameters are discussed [here](https://gatkforums.broadinstitute.org/gatk/discussion/comment/43880#Comment_43880).
From lidia on 2018-03-20
Hello GATK team!
I am writing to ask if there is a way to filter all the HomVar SNPs from a file with multiple samples.
Let me explain a little more in detail. I am working with two strains of fish, each strain with 8 samples and I have a referece genome. But right now I need to select only the SNPs that are present in just one of the strains (basically I need to compare between strains, not to the reference). Therefore I need to filter out all the ones that compared to the Reference are different, but that between strains are the same.
Is this possible?
From everestial007 on 2018-03-20
@lidia
Did you look at the tutorial I posted above in this comment thread. It is also available via biostars: https://www.biostars.org/p/255362/
I am sure there is a way to your problem in there.
From shlee on 2018-03-20
Hi @lidia,
The tutorial shared by @everestial007 uses [JEXL expressions](https://software.broadinstitute.org/gatk/documentation/article.php?id=1255), which is one way to approach this.
Another approach to consider using [ChromosomeCounts](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_annotator_ChromosomeCounts.php) annotations. First, you should subset your callset with [SelectVariants](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.1.2/org_broadinstitute_hellbender_tools_walkers_variantutils_SelectVariants.php) to retain only the samples per strain. Then, given the requirement that all samples in the strain-callset have hom-var sites you can consider using the expected value for `AC`, `AF` or `AN` for your cohort to filter.
From Phuong_Nguyen_1 on 2018-06-27
Hi,
Is there any way to put variables into JEXL -select command, e.g vc.getGenotype(“$i”)? because -select only read literal expression between single quote. That will be troublesome in case you would like to filter 100 vcf files with many different sample names.
Is there any better alternative than using JEXL to select variants, please share your advice.
Thank you very much.
From Sheila on 2018-06-28
@Phuong_Nguyen_1
Hi,
I responded [here](https://gatkforums.broadinstitute.org/gatk/discussion/comment/50002#Comment_50002). No need to post twice.
-Sheila
From Phuong_Nguyen_1 on 2018-07-03
>
Sheila said: >
Phuong_Nguyen_1
> Hi,
>
> I responded [here](https://gatkforums.broadinstitute.org/gatk/discussion/comment/50002#Comment_50002). No need to post twice.
>
> -Sheila
Hi, thank you for your reply, I am sorry for the duplicated post.
Phuong.
From Greg_Owens on 2018-07-13
I’m having an issue with selecting variants based on minor allele frequency while subsetting a smaller set. What I’d like to do is subset to a few samples, and retain anything that has variation (i.e. any AF > 0). My command is:
`gatk SelectVariants -sample-name SAMPLE1 -sample-name SAMPLE2 -V HanXRQChr01.tranche99.snp.vcf.gz -O tmp.vcf.gz —remove-unused-alternates -select ‘vc.isBiallelic() ? AF > 0.01 : vc.hasGenotypes() && vc.getCalledChrCount(vc.getAltAlleleWithHighestAlleleCount())/(1.0*vc.getCalledChrCount()) > 0.01’`
This runs for a short while and then outputs:
> A USER ERROR has occurred: Invalid JEXL expression detected for select-0
> See https://www.broadinstitute.org/gatk/guide/article?id=1255 for documentation on using JEXL in GATK
I believe the issue is the fact that I’m taking a dataset with many samples and selecting a few samples, so many alleles are being removed. If I retain all samples, it works fine. If I force it to only select biallelic sites, it works, but I want to keep the multiallelic sites as well.
From Sheila on 2018-07-16
@Greg_Owens
Hi,
>What I’d like to do is subset to a few samples, and retain anything that has variation
What happens if you just use `-sn sample1 -sn sample2 … —remove-unused-alternates`?
-Sheila
From FPBarthel on 2018-08-07
I have a field AO
in my VCF that I would like to filter on so I am using vc.getGenotype("sample_id").getAO() > 0
with VariantFiltration
, however that doesn't seem to work. I suspect getAO()
is not implemented. Is there some way to filter based on this?
```
```
UPDATE: Re-reading the page again I found the link to some Picard Java code here, and found that there exists a function getAttributeAsInt
and tried vc.getGenotype("sample_id").getAttributeAsInt("AO",0) > 0
which seems to work. Thought I'd leave this here for anyone else who may find this useful.
From Phuong_Nguyen_1 on 2018-10-11
Hi,
Here is snippet of my vcf file (from Mutect2)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR
NORMAL
chr1 985852 . C T . PASS DP=256;ECNT=1;NLOD=26.18;N_ART_LOD=-1.944e+00;POP_AF=2.500e-06;P_CONTAM=0.00;P_GERMLINE=-3.811e+01;REF_BASES=GAGGCCCCAGCGGCCTCCTGC;TLOD=92.21 GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB 0/1:130,35:0.220:84,16:46,19:31,33:193,224:42:26:0.202,0.192,0.212:8.067e-03,0.017,0.975 0/0:87,0:2.048e-04:67,0:20,0:32,0:222,0:0:0
I would like to filter AF attribute in FORMAT field but the JEXL expression was not working:
`-select ‘vc.getGenotype(“@TUMOR@”).getAF().floatValue() >= 0.200’`
error obtained: “A USER ERROR has occurred: Invalid JEXL expression detected for select-0“
Is there any expression available for this AF attribute? I guess “getAttributeAsInt” is for integer value only, how about float value ?
UPDATE: This expression work for me, maybe somebody could find it helpful:
-select ‘vc.getGenotype(“@TUMOR@”).getAttributeAsDouble(“AF”,0) >= 0.200’
However, there are several points I don’t understand:
- What is the “default value” of 0 in getAttributeAsDouble, what is its purpose ?
- Additionally, I am still confused regarding AF calculation in Mutect2, for example in my snippet: AF = 0.220 which is higher than Alt_AD/(Ref_AD+Alt_AD) , i.e 35/(130+35)=0.212. Could someone explain this different?
Thank you very much !
From shlee on 2018-10-11
Hi @Phuong_Nguyen_1,
If you search our forum with `getAttributeAsDouble`, you will come across a post that explains the second variable. The solution is provided by a GATK forum veteran @pdexheimer [here](https://gatkforums.broadinstitute.org/gatk/discussion/3435/filtering-on-allele-balance-by-sample-using-jexl). The second variable, in your case `0`, is the default value in case the annotation is nonexistant.
From sahiilseth on 2019-05-04
hi @Sheila the division of AD/DP does not work as expected unless we convert it to float. Could you update the post with converting to float. I had processed the samples and realized all the variants were being removed when using selectvariants :)
https://github.com/broadinstitute/gatk/issues/5916
```
java -jar GenomeAnalysisTK.jar -T SelectVariants \ -R reference.fasta \ -V variants.vcf \ -select vc.getGenotype(“NA12878”).getAD().1 / vc.getGenotype(“NA12878”).getDP() > 0.50
```
updated with `*1.0`
```
java -jar GenomeAnalysisTK.jar -T SelectVariants \ -R reference.fasta \ -V variants.vcf \ -select vc.getGenotype(“NA12878”).getAD().1*1.0 / vc.getGenotype(“NA12878”).getDP() > 0.50
```
From samlj on 2019-06-27
@sahiilseth I’ve been trying to get this to work as well, so your hint was very helpful, but I only want the genotypes which are heterozygous to be filtered out. When I use a compound expression like
```
java -Xmx2g -jar $GATK VariantFiltration \ -R $reference \ -V $VCF \ -O $output_VCF \ —genotype-filter-expression “(vc.getGenotype(\“subject1\”).isHet()) && ((vc.getGenotype(\“subject1\”.getAD().1*1.0)/(vc.getGenotype(\“subject1\”).getDP()) < 0.2)” \ —genotype-filter-name “alternateDepth“
```
any sample that is heterozygous at a locus where subject1 has a low alternate allele frequency (regardless of this sample’s alternate allele frequency) is also filtered:
``` #FORMAT subject1 subject2 subject3
GT:AD:DP:GQ 0/0:4,0:4:22 0/0:12,0:12:25 ./.:0,0:0:0
GT:AD:DP:GQ 0/0:1,0:1:52 ./.:1,0:1:88 0/1:1,1:2:50
GT:AD:DP:GQ 1/1:1,42:43:25 1/0:30,10:40:99 0/0:17,0:17:24
GT:AD:DP:FT:GQ 0/1:30,1:31:alternateDepth:99 ./.:0,0:0:PASS:0 ./.:1,1:2:PASS:3
GT:AD:DP:FT:GQ 0/1:30,1:31:alternateDepth:99 0/1:12,13:25:alternateDepth:0 0/1:1,299:300:alternateDepth:3
```
Is this the expected behavior? Is there a way to change it?
(Sorry for the formatting – my markdown isn’t working for some reason)