created by shlee
on 2016-02-03
We use the term spanning deletion or overlapping deletion to refer to a deletion that spans a position of interest.
The presence of a spanning deletion affects how we can represent genotypes at any site(s) that it spans for those samples that carry the deletion, whether in heterozygous or homozygous variant form. Page 8, item 5 of the [VCF v4.3 specification](https://samtools.github.io/hts-specs/VCFv4.3.pdf) reserves the `*` allele to reference overlapping deletions. This is not to be confused with the bracketed asterisk `<*>` used to denote symbolic alternate alleles.
——
Here we illustrate with four human samples. Bob and Lian each have a heterozygous `A` to `T` single polymorphism at position 20, our position of interest. Kyra has a 9 bp deletion from position 15 to 23 on both homologous chromosomes that extends across position 20. Lian and Omar each are heterozygous for the same 9 bp deletion. Omar and Bob’s other allele is the reference `A`.
What are the genotypes for each individual at position 20? For Bob, the reference A and variant T alleles are clearly present for a genotype of `A/T`.
What about Lian? Lian has a variant T allele plus a 9 bp deletion overlapping position 20. To notate the deletion as we do single nucleotide deletions is technically inaccurate. We need a placeholder notation to signify absent sequence that extends beyond the position of interest and that is listed for an earlier position, in our case position 14. The solution is to use a star or asterisk `*` at position 20 to refer to the spanning deletion. Using this convention, Lian’s genotype is `T/*`.
At the sample-level, Kyra and Omar would not have records for position 20. However, we are comparing multiple samples and so we indicate the spanning deletion at position 20 with `*`. Omar’s genotype is `A/*` and Kyra’s is `*/*`.
——
In the VCF, depending on the format used by tools, positions equivalent to our example position 20 may or may not be listed. If listed, such as in the first example VCF shown, the spanning deletion is noted with the asterisk `*` under the `ALT` column. The spanning deletion is then referred to in the genotype `GT` for Kyra, Lian and Omar. Alternatively, a VCF may altogether avoid referencing the spanning deletion by listing the variant with the spanning deletion together with the deletion. This is shown in the second example VCF at position 14.
Updated on 2017-06-01
From abor on 2016-09-09
Is there a way to get the format of the second example by using GATK 3.6?
From Geraldine_VdAuwera on 2016-09-09
No, that’s not possible, sorry.
From tc13 on 2016-09-23
What’s the best way to remove spanning deletions from a vcf?
I tried (unsuccessfully): SelectVariants -select “ALT == ‘*’” -invertSelect
From Sheila on 2016-09-30
@tc13
Hi,
In [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/8211/alternate-alleles-in-vcf-are-more-than-1-base#latest) you managed to get SNPs only. However, are you wanting to keep indels this time too? I think `-selectTypeToExclude SYMBOLIC` should do the trick. Let us know if it does not.
-Sheila
From everestial007 on 2017-06-05
Geraldine_VdAuwera
shlee : Thank you for the link and a new method for representing complex variants.
Since, I am working with phasing, the use of the `*` is going to complicate things to make the alternate genome. There are several places in our two diverged population samples that these `*` are fixed for one population vs. another, so these variants might be highly useful.
You said in earlier post that there is no way to revert to latter representation of the variants, but would it be possible to get a simple representation of the variants if I split the multisample vcf to several single sample vcf, and convert `*` to represent just the alleles in that sample. This would help to make more accurate alternate genome for that individual. I already split the vcf just to get bi-allelic representation, but * are still there. Is there anyone I can talk to, to get some hints?
From Sheila on 2017-06-07
@everestial007
Hi,
What is the command you ran to split the VCFs?
-Sheila
From everestial007 on 2017-06-08
@Sheila
I simply used GATK `SelectVariants with -sn option` to split the vcf by samples.
From Marta on 2017-06-08
Hi everyone
I apologyze for the very naive question but I received some vcf files from our collaborators and I would like to annotate them by using SnpEff. This is the first time for me to menage ta VCFv4.2 and the program can’t read the “*”. This is an example:
chr1 6529186 . TCC TC,T,* 358931 PASS AC=4,1,163;AF=0.005208,0.001302,0.212;AN=768;BaseQRankSum=0.42;ClippingRankSum=0.624;DP=103800;ExcessHet=84.2774;FS=0;MLEAC=4,1,165;MLEAF=0.005208,0.001302,0.215;MQ=11.92;MQRankSum=-0.035;QD=5.97;ReadPosRankSum=-0.086;SOR=0.675 GT:AD:DP:GQ:PL 0/3:274,0,0,29:308:99:322,1144,12459,1144,12459,12459,0,11319,11319,11239
Exception in thread “main” java.lang.RuntimeException: java.lang.RuntimeException: WARNING: Unkown IUB code for SNP ‘*’
How can I do? Do you think that the more recent version of SnpEff could solve my problem? Is there an alternative method to delete this issue without loose any informations? Thank you
From everestial007 on 2017-06-08
@Marta
Do you want to keep the `GT = */*` and translate it into actual nucleotide codes or are you fine with removing them? For the former issue, see the question I posted just 2 days ago. If you are fine with removing the `*/*` completely just remove lines with `*/*` for that sample or for all the samples. Check out the tutorials, I posted here http://gatkforums.broadinstitute.org/gatk/discussion/comment/39096#Comment_39096
From Sheila on 2017-06-09
@everestial007
Hi,
Can you try adding [`—removeUnusedAlternates`](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php#—removeUnusedAlternates) and [`—excludeNonVariants`](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php#—excludeNonVariants) to you command?
-Sheila
From everestial007 on 2017-06-09
Hi @Sheila
My actual question was if there is a way to convert the `* ` to real nucleotide codes when splitting the vcfs. Sorry, if the question was confusing.
Thanks,
From Sheila on 2017-06-12
@Marta
Hi,
I think this may be a question for the SnpEff developers, as the * allele is indeed supported by the [VCF spec](https://samtools.github.io/hts-specs/VCFv4.2.pdf). You can also try out [Oncotator](http://gatkforums.broadinstitute.org/gatk/categories/oncotator) which is supported by our team.
-Sheila
From Sheila on 2017-06-13
@everestial007
Hi,
Ah, I see. There is no way to do that with GATK tools. Have a look at my answer above. Since the VCF with * allele is in accordance with the VCF spec, you will need to write your own tools to convert * allele to something else.
-Sheila
From everestial007 on 2017-06-13
@Sheila
Hope updated pyVCF module will have some options to mine `gt_bases` for `GT = */*`, sometime soon. Will post a solution if any. Thanks
Geraldine_VdAuwera
Sheila
I am not sure and wanted to ask. If it is possible to create a personal tutorial section on GATK. Since, my data-analyses and pipe line are mostly dependent on GATK, I thought it would be wise and helpful to put some methods I have explored here. Some important things could be 1) mining variants (single sample vs. multi sample) in vcf, which I put last week but can’t modify it now, 2) phasing in F1 hybrids.
Thanks,
From Geraldine_VdAuwera on 2017-06-14
@everestial007 Do you mean a section where community users like yourself could post and maintain tutorials to supplement the materials we provide? If so that’s an intriguing idea. I’m not opposed to it in principle but would like to give some thought to how we would organize and curate it.
From everestial007 on 2017-06-15
@Geraldine_VdAuwera
Yes, that’s what I meant. Let me know.
Thanks,
From KlausNZ on 2017-06-22
Hi Sheila and Geraldine,
In Kyra’s single-sample vcf, `20 A *` may well be valid vcf, but `14 CCCCCACCC G`
1) would be far more informative and concise (which is also a requirement of the vcf spec), and
2) `SelectVariants —removeUnusedAlternates` does recompute POS and REF when producing a single-sample vcf for analogous homozygous genotypes from a jointly called multi-sample file (eg `22 19188993 GCGGTCTCC GCGGTT,GAGA` becomes `22 19188998 CTCC T` when selecting the 1/1 sample, and
3) the current behaviour creates different representations of the same variant when applying the same tools (HaplotypeCaller and GenotypeGVCFs) in single vs joint (followed by single-sample selection) modes.
So the situation seems somewhat inconsistent, and its worst consequence may be resistance to joint calling. Could you consider enabling an option in SelectVariants to change the output for single-sample homozygotes? Everywhere else of course, `*` makes great sense despite the complications….
From Geraldine_VdAuwera on 2017-06-24
Hi @KlausNZ, let me run this one by our devs since spanning deletions are a pretty contentious topic.
From KlausNZ on 2017-06-25
Thanks Geraldine!
From Geraldine_VdAuwera on 2017-07-05
Hi @KlausNZ, our devs agree that it would make sense to enable getting rid of single-sample / records, and generally to enable selection/removal based on * alleles (which is currently not possible either). I’ll put in a ticket to get that done in GATK4; be aware that it probably won’t be backported to GATK3 as we’re very close to putting a definitive lid on the 3.x series.
From KlausNZ on 2017-07-18
Hi Geraldine, that’s great news! Many thanks for considering this. It will help greatly I predict. No worries re 3.x, we’re keen to move into the world of 4.x
From everestial007 on 2017-12-13
@Geraldine_VdAuwera :
Can you please update if the problem with `*` allele is fixed? And, if so how should I proceed with removing it or selecting (and converting) it.
From Geraldine_VdAuwera on 2017-12-13
@everestial007 That work has not yet been done, as we’ve been prioritizing work that is critical for the GATK 4.0 release (which includes several major new workflows). Sorry for the bad news. I can’t yet give you a timeline for when this will be addressed — starting Jan 9 it’s a new world, and we’re going to be reexamining some of our stack of priorities based on people’s feedback at that time.
From hydkat on 2018-02-13
Hi,
I created a dummy VCF file which contains sample records with the presence of a “*” allele in the ALT column…
Can anyone take a look at the ALT notations and sample GT values given in the file and tell if the scenarios described are valid?
Thanks,
Karthik
From Sheila on 2018-02-20
@hydkat
Hi Karthik,
You can use [ValidateVariants](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_variantutils_ValidateVariants.php).
-Sheila
From CNBers on 2018-09-12
>
Sheila said: >
everestial007
> Hi,
>
> Can you try adding [`—removeUnusedAlternates`](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php#—removeUnusedAlternates) and [`—excludeNonVariants`](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php#—excludeNonVariants) to you command?
>
> -Sheila
@Sheila I am using GATK3.8 to select SNP, but I want to remove * alleles, how to do this ?
Best,
Ningbo
From shlee on 2018-09-12
Hi @CNBers,
Removal of unused `*` alleles is a new feature that was just incorporated into SelectVariants. This was merged into the gatk repository’s master branch two weeks ago ([PR](https://github.com/broadinstitute/gatk/pull/5129)). Can you wait for the next release of GATK? If you are in a hurry, it is possible to build the GATK tool bundle from the master branch following instructions on [this page](https://github.com/broadinstitute/gatk).
From CNBers on 2018-10-24
>
shlee said: > Hi
CNBers,
>
> Removal of unused `*` alleles is a new feature that was just incorporated into SelectVariants. This was merged into the gatk repository’s master branch two weeks ago. Can you wait for the next release of GATK? If you are in a hurry, it is possible to build the GATK tool bundle from the master branch following instructions on.
Dear @shlee
When the new GATK will be released ?
From shlee on 2018-10-24
Hi @CNBers,
There have been a number of releases since September 12. The latest as of today is v4.0.11.0 and you can find it [here](https://github.com/broadinstitute/gatk/releases).
From Fick1995 on 2019-01-10
This is a good one :)
From afzm on 2019-04-16
Is there now a way to get the format of the second example by using GATK4? Because if not, when I normalize the VCF file (with pre.py), the * get removed and I lost all that info. And I do not know the impact that that lost would have when it comes to the phasing step. If there is not a way, would you suggest any idea of how to do it? Thank you