created by delangel
on 2012-07-26
In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.
As of version 3.3, the HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy
argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.
For earlier versions (all the way to 2.0) the fallback option is UnifiedGenotyper, which also accepts the -ploidy
argument.
For normal organism ploidy, you just set the -ploidy
argument to the desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool)
.
Several variant annotations are not appropriate for use with non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following: QUAL
, QD
, SB
, FS
, AC
, AF
, and Genotype annotations such as PL
, AD
, GT
, etc.
You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.
Updated on 2014-10-24
From marcustuke on 2012-07-26
As GATK2 can handle Mitochondrial DNA, is there a recommended ploidy setting for human Mitochondria? I understand that mtDNA can vary dramatically in how many copies are present in a cell, but is there some sort of consensus value? (e.g. some sort of function of mean coverage)
Thank-you very much!
From delangel on 2012-07-26
We’ve experimented with 50 to 100 but we make no optimality claims on that – probably a better number would be the ratio of (mean coverage in the MT contig) / (mean coverage in somatic chromosomes)
From mike_lyons on 2012-08-29
@delangel are there any other recommended settings for MT with GATK?
From sahiilseth on 2012-09-20
@delangel
How does UG use this ploidy information for calling variants in MT? For SNPs at any position we dont expect more than 4 alleles (ATGC).
In our low-pass data we have 5-7X coverage overall, and ~700X in case of mitochondria.
From delangel on 2012-09-21
It’s internal machinery needs to know the organism ploidy (i.e. number of chromosomes inside) to work well (btw number of possible different alleles is different than ploidy). Given your coverage I’d start with -ploidy 100 or so
From kjclowers on 2014-04-16
What if I want to do multi-sample SNP calling using the unified genotyper on a mixture of haploid and diploid organisms? I have many yeast genomes where most are haploid, but a few are diploid. Will I have to call variants on the two groups separately?
From Sheila on 2014-04-16
@kjclowers
Hi,
You will need to run the UnifiedGenotyper separately for the haploid and diploid genomes. You have to specify which ploidy the UG should expect for each group in your command line, so unfortunately you cannot run a mixture of ploidies.
-Sheila
From tommycarstensen on 2015-01-13
@delangel Do you throw your MT (and Y and non-PAR X) variants into VQSR along with your autosomal variants? I’m worried this will bias depth dependent annotations. I’m more keen on doing separate filtering for these variants (at least MT and Y), but I would probably have to do hard filtering, since the number of variants is probably insufficient for VQSR and because not all SNP truth/training sets contain all chromosomes:
hapmap_3.3.b37.vcf.gz 1-22,X,Y,MT 1000G_omni2.5.b37.vcf.gz 1-22,X,Y 1000G_phase1.snps.high_confidence.b37.vcf.gz 1-22,X dbsnp_138.b37.vcf.gz 1-22,X,Y
I would be very happy to get your insight on this, because the supplementary material to projects such as 1000G phase1 and GoNL is a bit scattered in terms of variant calling and filtering of mtDNA.
From Geraldine_VdAuwera on 2015-01-14
Hi Tommy @tommycarstensen
I’m not aware that we do any special handling of the non-autosomal components in the production pipeline — I think the assumption is that there are few enough of them to not affect the rest. But it would be interesting to see a formal comparison. I sure can’t imagine that you could get VQSR to run on them alone.
PS: FYI Guillermo (@delangel) has moved on to another job so I don’t know if he still gets GATK forum emails — best not count on it.
From tommycarstensen on 2015-01-15
@Geraldine_VdAuwera thanks for this. If I do any comparisons, then you will be the second to know. I’m still making changes to my analysis plan. I’m worried that Y and non-PAR X will have true positives filtered out by VQSR, because they are haploid in males and VQSR uses a few depth related annotations. I’ll update this thread in time.
From tommycarstensen on 2015-01-16
@Geraldine_VdAuwera Just as an aside. The memory use of UG3.3 seems to explode and it seems to stall or only walk slowly, when I run it with -ploidy 1 on chromY and the non-PARs of chromX for male samples. I’m switching to diploidy for the whole genome for samples of both sexes. I am however happy to provide command line etc., if you are eager to troubleshoot. Apologies for the many posts over the past few weeks and months. I’ve been testing out a lot of things.
From Geraldine_VdAuwera on 2015-01-16
@tommycarstensen To be honest we have zero resources available for UG troubleshooting/improvements — I’m afraid you’re on your own on this one.
No worries, your questions have helped us nail down a few issues.
From eflannery on 2015-01-17
Hi Geraldine,
I have a question about using UG or haplotype caller with a haploid organism, but which I have more than one clone in a sample. We generally use UG with ploidy set to two (default) and assume het variants are mixed reads from different clones. My worry is UG expects a certain number of hets to be called. Is this true? I’ve thought about setting ploidy to e.g. 4 and then saying we can only detect clones present at at least 25% of the sample. What are your thoughts. If you run UG with ploidy=1, will it allow for het calling? Should I try Mutec instead? Thanks!
From Geraldine_VdAuwera on 2015-01-19
Hi @eflannery,
There is no expectation of numbers of hets to be called; UG calls each site independently of any others. But there is an expectation of allelic ratios that depends directly on the ploidy setting. If you use ploidy = 1, the caller will make haploid calls only (no hets possible) so in effect it will pick the major allele present at highest frequency in your population. So unless that’s what you want, you should use ploidy > 1. The higher the ploidy, the more likely you are to pick up additional alleles — but it wll also increase your false positives. So it’s up to you to choose where to set the cutoff.
MuTect has very different internal logic since it does not try to pick alleles based on ploidy and/or allele ratio expectations. It is a lot more permissive re: allele ratios. But it is designed to work with paired tumor/normal samples, and while it can be run on non-paired samples, that’s not the supported usage. You can certainly try it, but there is no guarantee that it will do the right thing for your use case.
From u0597274 on 2016-11-28
Is there any way to force HaplotypeCaller to do “naive” variant calling? i.e. call every possible variant. It seems excessive memory usage is inevitable when using high ploidy parameters – Freebayes, for example, just has an option to call everything.
Also, does MuTect not lack support for non-diploid genomes?
From Geraldine_VdAuwera on 2016-11-28
@u0597274 What does it mean to “call everything”? The performance cost of using high ploidies is due to the need to calculate genotype likelihoods for a large number of possible combinations. If you’re calling every possible variant, you still need to calculate all the possible genotypes — or am I misunderstanding what you’re saying?
Note that in the upcoming 3.7 version we’ve put in some logic to cull potential alleles that are actually very unlikely to be real, in order to reduce the amount of genotype combinations we have to calculate, which buys us some performance improvements.
From u0597274 on 2016-12-20
@Geraldine_VdAuwera – my apologies, I just saw your reply! It seems I had not configured my forum notification settings properly.
Forgive me, I’m not much of a bioinformatics expert, so my phrasing isn’t perfect, but I was basically asking if there is any way to force HaplotypeCaller to essentially behave like Mutect. My goal is to perform somatic variant calling on the mouse mitochondrial genome with RNA-Seq data. I’d be happy to turn off calculating genotype likelihoods – I just want to take advantage of the HaploTypeCaller’s realignment capabilities for detecting somatic indels.
On a side note, it seems GATK4 has gotten rid of IndelRealignment…
From Geraldine_VdAuwera on 2016-12-20
@u0597274 I see, thanks for clarifying. I can’t recommend using HaplotypeCaller for that purpose, the likelihood model is just not appropriate. You’re better off using MuTect2 in artifact detection mode.
You’re correct that Indel Realignment tools are not available at all in GATK4. This is because the GATK4 variant callers do their own realignment internally, so prior realignment is no longer necessary.
From jingjin0322 on 2018-02-05
Hi @Geraldine_VdAuwera , I just had a question related to this post. Hope you could help me out.
I work on a diploid organism, and I have ddRAD seq data from 104 samples of the organism. I added different ID, SM, LB to the bam files for each of the 104 samples and generated gvcf files. I wonder when I use CombineGVCFs to call the variants, should I specify —ploidy 208 ? I guess I am not quite sure how a sample pool is defined?
Also, 16 published genomes of the organism I’m working on are available to me. I believe the published genome is the haplotype of the organism. So if I wanted to combine the gvcf files from the 16 published genomes using ComebineGVCFs, should I specify —ploidy 1 ?
My last question is I’d like to incorporate the published genome into my ddRADseq data to call variants across my samples and the published genomes? What will be the best way to do it? Should I generate two vcf files, one for published genomes and one for my samples, and then combine them?
Really hope you could give some suggestions. Thanks in advance!
From Geraldine_VdAuwera on 2018-02-06
Hi @jingjin0322, if the samples are identified by individual SM tags then they are not considered pooled, and you don’t need to set a ploidy (unless the ploidy of the organism is not 2).
In what formats are the published genomes that you want to include in your analysis?
From jingjin0322 on 2018-02-23
Hi @Geraldine_VdAuwera ,
Thanks for your reply! Really apprecited it!
The published genomes are in fasta format, I did in silico enzyme digestion and converted the digested genome fragments into fastaq format.
Also, there are published SNP data in VCF format available to me too.
Thanks again!
From Sheila on 2018-02-26
@jingjin0322
Hi,
You can use the FASTQ genomes and run those through the Best Practice pipeline like you do for your own samples. You should call variants on both the published genomes and your own samples together, so you can easily compare the variants.
You should not use the already made VCFs, as they could have been produced using a different reference or workflow.
I hope this makes sense.
-Sheila
From jingjin0322 on 2018-02-26
Hi @Sheila ,
Thanks for you reply! Yes, that absolutely makes sense to me. I actually tried that and a new question came up.
The organism I am working on is diploid, however, I believe the published genome is the haplotype (or the allele with higher sequencing frequency was assembled into the genome at heterozygous site) of the organism. I wonder if that will have any impact on my data analysis if I call variants on both the published genomes and my own samples together?
Thanks for your time and help!
From Geraldine_VdAuwera on 2018-02-27
I’m afraid you’re going to have a hard time using the data from the published genomes if they’re only FastAs that you chopped up. Partly because they’re only haplotypes, but mostly because that method effectively gives you only a read depth of 1 at every position, which will not be interpreted correctly by GATK. You should try to find out if there is a short read dataset (ie the original sequence) available for those genomes that you can make GVCFs from. If not, then you cannot include them into the GATK joint calling process. What you would do instead is generate a VCF of variants through direct sequence alignment and use that as a comparison resource in your analysis after having called variants on your samples.
From jingjin0322 on 2018-02-28
Hi @Geraldine_VdAuwera , thank you for the suggestions! I’ll try that.
From RSmithAg on 2018-11-09
Hi, I have a question regarding this I hoped you could help me with.
I am working with an organism which has haploid and diploid stages. I sequenced an unknown number of them pooled. I also cannot tell which of the stages it was in, probably both. I used UnifiedGenotyper and I set -ploidy to 2 (default). My question is if this would impact my results and how? I am trying to obtain alternative allele frequencies and I am not sure to which extent setting -ploidy to 1 or 2 would make a difference in the variants called and in their frequencies.
Thanks.