created by shlee
on 2018-07-03
Before using VariantFiltration, please read the entirety of the discussion in https://github.com/broadinstitute/gatk/issues/5362 that describes VariantFiltration's unintuitive behavior when processing compound expressions.
This tutorial illustrates how to filter on genotype, e.g. heterozygous genotype call. The steps apply to either single-sample or multi-sample callsets.
First, the genotype is annotated with a filter expression using VariantFiltration. Then, the filtered genotypes are made into no-call (./.
) genotypes with SelectVariants so that downstream tools may discount them.
We use example variant record FORMAT fields from trio.vcf
to illustrate.
GT:AD:DP:GQ:PL 0/1:17,15:32:99:399,0,439 0/1:11,12:23:99:291,0,292 1/1:0,30:30:90:948,90,0
If we want to filter heterozygous genotypes, we use VariantFiltration's --genotype-filter-expression "isHet == 1"
option. We can specify the annotation value for the tool to label the heterozygous genotypes with with the --genotype-filter-name
option. Here, this parameter's value is set to "isHetFilter"
.
gatk VariantFiltration \ -V trio.vcf \ -O trio_VF.vcf \ --genotype-filter-expression "isHet == 1" \ --genotype-filter-name "isHetFilter"
After filtering, in the resulting trio_VF.vcf
, our example record adds an FT
field and becomes:
GT:AD:DP:FT:GQ:PL 0/1:17,15:32:isHetFilter:99:399,0,439 0/1:11,12:23:isHetFilter:99:291,0,292 1/1:0,30:30:PASS:90:948,90,0
We see that HET (0/1
) genotype calls get a isHetFilter
in the FT
field and other genotype calls get a PASS
in the genotype field.
The VariantFiltration tool document lists the various options to filter on the FORMAT (aka genotype call) field:
We have put in convenience methods so that one can now filter out hets ("isHet == 1"), refs ("isHomRef == 1"), or homs ("isHomVar == 1"). Also available are expressions isCalled, isNoCall, isMixed, and isAvailable, in accordance with the methods of the Genotype object.
Running SelectVariants with --set-filtered-gt-to-nocall
will further transform the flagged genotypes with a null genotype call. This conversion is necessary because downstream tools do not parse the FORMAT-level filter field.
gatk SelectVariants \ -V trio_VF.vcf \ --set-filtered-gt-to-nocall \ -O trioGGVCF_VF_SV.vcf
The result is that the GT
genotypes of the isHetFiltered
genotype records become null or no call (./.
) as follows.
GT:AD:DP:FT:GQ:PL ./.:17,15:32:isHetFilter:99:399,0,439 ./.:11,12:23:isHetFilter:99:291,0,292 1/1:0,30:30:PASS:90:948,90,0
Updated on 2018-11-20
From vytautasrask on 2018-07-31
Very interesting tutorial. Where I can get trio.vcf to practice it?
From shlee on 2018-07-31
Hi @vytautasrask,
You can use trio data from the Workshop tutorial bundle. You can find a link to workshop materials on the Presentations page at https://software.broadinstitute.org/gatk/documentation/presentations.
From shlee on 2018-11-20
Before using VariantFiltration, please read the entirety of the discussion in https://github.com/broadinstitute/gatk/issues/5362 that describes VariantFiltration’s unintuitive behavior when processing compound expressions.
From avrajit on 2018-11-23
Hi I did this using gatk-4.0.11.0 VariantFiltration
-R GRCh38.p12.genome.fasta -V GENOTYPE.vcf
—filter-name “FS” —filter-expression “FS > 30.0” \
—filter-name “QD” —filter-expression “QD < 3.0” \ —filter-name “DP” —filter-expression “DP >= 20” \ -O filtered.vcf
But the terminal output sowing
ProgressMeter – Starting traversal
17:18:50.892 INFO ProgressMeter – Current Locus Elapsed Minutes Variants Processed Variants/Minute
17:18:52.328 INFO VariantFiltration – No variants filtered by: AllowAllVariantsVariantFilter
17:18:52.329 INFO ProgressMeter – KI270850.1:223180 0.0 75222 3142980.5
17:18:52.329 INFO ProgressMeter – Traversal complete. Processed 75222 total variants in 0.0 minutes.
17:18:52.474 INFO VariantFiltration – Shutting down engine
is that mean I am doing wrong somewhere?
is there any problem input?
for more information
this is generated form RNA-seq data
variant call done with haplotypeCaller
combined vcfs by CombineGVCFs (control and treated)
genotyping with GenotypeGVCFs
please help
From mernster on 2019-10-25
Hi, I have a multisample vcffile and would like to edit heterozygous individuals with a low or high allelic balance (AB).
If the AB is below 0.25, I would like to change the genotype to 1/1 (homozygous for alternate). If the AB is above 0.75, I would like to change the genotype to 0/0 (homozygous for reference allele).
Since I dont have the AB information for each genotype in my vcf, I am trying to use the genotype specific RO and DP instead. RO is the count of full observations of the reference haplotype and DP is the total number of reads.
I have been trying to use this command to tag the variants that I want to convert:
gatk VariantFiltration -R /ref.fasta -V input.vcf -O output.vcf —genotype-filter-expression ‘isHet == 1 && RO / DP < 0.25 ‘ —genotype-filter-name ‘lowAB’
gatk VariantFiltration -R /ref.fasta -V input.vcf -O output.vcf —genotype-filter-expression ‘isHet == 1 && RO / DP > 0.75 ‘ —genotype-filter-name ‘highAB’
But somehow this just adds an anotation to all heterozygous positions… Any ideas on what might be wrong?
Also, how could I proceed afterwords? is there any way to change the genotype fields of lowAB to homozygous for alternate allele and the highAB to homozygousfor reference allele?