175. Any ploidy goes

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2014-08-27

Until now, HaplotypeCaller was only capable of calling variants in diploid organisms due to some assumptions made in the underlying algorithms. I'm happy to announce that we now have a generalized version that is capable of handling any ploidy you specify at the command line!

This new feature, which we're calling "omniploidy", is technically still under development, but we think it's mature enough for the more adventurous to try out as a beta test ahead of the next official release. We'd especially love to get some feedback from people who work with non-diploids on a regular basis, so we're hoping that some of you microbiologists and assorted plant scientists will take it out for a spin and let us know how it behaves in your hands.

It's available in the latest nightly builds; just use the -ploidy argument to give it a whirl. If you have any questions or feedback, please post a comment on this article in the forum.

Caveat: the downstream tools involved in the new GVCF-based workflow (GenotypeGVCFs and CombineGVCFs) are not yet capable of handling non-diploid calls correctly -- but we're working on it.

UPDATE:

We have added omniploidy support to the GVCF handling tools, with the following limitations:

When running, you need to indicate the sample ploidy that was used to generate the GVCFs with -ploidy. As usual 2 is the default ploidy.
The system does not support mixed ploidy across samples nor positions. An error message will be thrown if you attempt to genotype GVCFs that have a mixture, or that have some genotype whose ploidy does not match the -ploidy argument.

LATEST UPDATE:

As of GATK version 3.3-0, the GVCF tools are capable of ad-hoc ploidy detection, and can handle mixed ploidies. See the release highlights for details.

Updated on 2014-12-16

From charlesbaudo on 2014-08-29

I’m in the process of comparing variant calls made by UnifiedGenotyper and this nightly build of HaplotypeCaller on 7 samples of a haploid organism.

Here is my UnifiedGenotyper command: `java -Xmx6g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R /path/ref.fasta -I /path/7_sample_merge.bam -glm BOTH -ploidy 7 -o 7_sample.raw.vcf`

and here is my HaplotypeCaller command: `java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R /path/ref.fasta -I /path/sample1.bam -I /path/sample2.bam -I /path/sample3.bam -I /path/sample4.bam -I /path/sample5.bam -I /path/sample6.bam -I /path/sample7.bam -stand_call_conf 30 -stand_emit_conf 10 -ploidy 7 -o 7_samples.raw.vcf`

The terminal printout for UG shows the number of processed active regions , while the HC does not. I see this phenomenon when I have a single BAM input and ploidy specified as 1 and with two BAM inputs and ploidy specified as 2. Disclaimer: I have never used HaplotypeCaller before as I solely work on a haploid organism nor have I attempted to HC on diploid data. Apologies if this is the intended behavior, but I thought I should inquire about it.

UG:

> INFO 09:47:01,777 ProgressMeter – Supercontig_6:765865 3.27e+07 48.5 h 88.8 m 79.7% 60.8 h 12.4 h

> INFO 09:48:01,782 ProgressMeter – Supercontig_6:776249 3.28e+07 48.5 h 88.8 m 79.7% 60.8 h 12.3 h

Nightly HC:

> INFO 09:55:14,091 ProgressMeter – Supercontig_1:5972736 0.0 19.4 h 15250.3 w 14.5% 5.6 d 4.8 d

> INFO 09:56:14,095 ProgressMeter – Supercontig_1:5976722 0.0 19.5 h 15250.3 w 14.5% 5.6 d 4.8 d

From Geraldine_VdAuwera on 2014-08-29

Hi @charlesbaudo,

First, thanks for being willing to give this a try!

The difference in the ProgressMeter behavior that you’re seeing is expected. It’s due to fundamental differences in how UG and HC traverse the data. UG traverses the genome one individual position at a time, so it’s easier to track its progress, and it goes through each unit of data much faster than HC. In contrast, HC traverses the data by ActiveRegions, which are stretches of genome that can go up to several hundred bases (or more if you set the max higher). Each unit takes much longer to process since it contains more data, and so you may not see the count go up for a while (although it should move eventually, unless something is broken that I’m not aware of).

That said I realize I wasn’t clear about the usage of the ploidy argument. If you have 7 samples that are clearly separated with distinct read groups (as opposed to pooled samples which are not identified) you need to set the ploidy value to be the actual ploidy of a single sample, as opposed to the aggregate ploidy over all samples. So here you would set the ploidy to 1 for your haploid samples, otherwise you’re asking HC to consider each sample as a heptaploid. This should speed up the process as HC will have considerably less work to do.

I might also recommend doing a single-sample to single-sample comparison first before diving into multi-sample comparisons, but that’s up to you.

From charlesbaudo on 2014-08-29

Thank you for your quick response, @Geraldine_VdAuwera. To clarify, if I have merged those same 7 BAM files I should still set ploidy to 1 if they have distinct read groups? And this recommendation applies to both HC and UG? Thank you.

From Geraldine_VdAuwera on 2014-08-29

As long as the samples have distinct read groups (specifically, distinct SM tags) it makes no difference whether they are in the same file or in separate files. The GATK engine merges all the data when it reads in the file contents, and distinguishes samples by SM tags. This does apply to both UG and HC, yes.

From charlesbaudo on 2014-08-29

Thank you for the clarification. Cheers.

From Geraldine_VdAuwera on 2014-09-10

Updated the text to reflect added support for GVCF tools, with stated limitations.

From travcollier on 2014-10-14

The docs for HapylotypeCaller (from the nightly build 2014-10-14-gf24cf57) still say that it only supports a ploidy value of 2 under the ‘Caveats’ section.

That should probably get updated too.

From Geraldine_VdAuwera on 2014-10-14

Indeed, thanks for pointing it out.

From Oprah on 2014-12-16

Does GenotypeGVCFs in the latest nightly build have features beyond v3.3-0, with respect to mixed ploidy across samples/positions? Thanks.

From Geraldine_VdAuwera on 2014-12-16

Actually, version 3.3 supports mixed ploidy; this post refers to 3.2. I’ll add a note. Meanwhile, see https://www.broadinstitute.org/gatk/blog?id=4741 for the mixed ploidy announcement.

From sepid on 2015-06-12

May I ask you whether you know any polyploidy (fragments) dataset available publicly?

From Geraldine_VdAuwera on 2015-06-14

I’m afraid not, @sepid, sorry. Maybe SeqAnswers or BioStars would know.

From sepid on 2015-06-17

I have a question about how the algorithm of GATK-Polyploidy & mix-Ploidy work? have you guys published it or have you released the details of your algorithm? I very much appreciate your response!

From Geraldine_VdAuwera on 2015-06-17

It hasn’t been published yet, sorry. But you can view the code [here](https://github.com/broadgsa/gatk-protected) if you want.

From Katie on 2015-09-15

I have a question about specifying ploidy for population sequence data. I am conducting a population sequencing study on a vector-borne bacteria, so each genomic library I have is the bacterial population infecting a single vector, but constitutes a pool of an unknown number (likely 1-7) different bacterial clones. I am planning to use Haplotype Caller to call variants but am wondering how to deal with ploidy when I have an unknown ploidy (or number of coinfecting bacterial strains) present within each genomic library. Should I set the ploidy to 7 (the maximum we likely will see in nature) for all samples and call them all together?

From Geraldine_VdAuwera on 2015-09-16

@Katie There is no absolute answer to your question; it’s going to be a tradeoff between sensitivity (aim to detect every variant, even if present only as a minority fraction) and specificity (aim to reduce false positives resulting from sequencing noise/artifacts). If you want to maximize sensitivity, you’ll need to set the ploidy to the highest likely number of clones in your population.

From Katie on 2015-09-16

Okay, thanks for getting back to me so quickly!

From Katie on 2016-01-09

One more question about setting ploidy for a haploid sample with an unknown number of clones present at unknown proportions (i.e. unlikely they are present at equal proportions). If I set ploidy at 10 because that’s the highest likely number of clones in my population, does the variant detecting algorithm assume that each clone should be present at 10%? In other words, am I then limiting my sensitivity to any variant found in > 10% of the total population? Or instead, is does the ploidy argument limit the number of unique alleles at a locus? HaplotypeCaller is very slow with high ploidy, but I would like to maintain high sensitivity to minority variants.

Thanks for your help!

From Geraldine_VdAuwera on 2016-01-09

@Katie Your first supposition is correct — and yes that would limit your sensitivity to minority variants. Unfortunately the HC was not designed to handle the use case of indeterminate pools.

From Katie on 2016-01-10

Thank you for your help on this. It seems like variant calling may be a two step process:

(1) to identify variants informative of differentiation between pooled samples, set PLOIDY=1 (call consensus allele at each locus) and run HaplotypeCaller on all BAM files which may or may not contain mixed samples.

(2) to identify within-host polymorphisms, I will then set PLOIDY=100 and run HaplotypeCaller on each BAM file individually. (Calling variants on all samples at once with high PLOIDY is computationally infeasible.)

In this case, can I leave the HAPLOTYPE argument at its default value, because this will not have as strong an effect on sensitivity to minority variants?

From Geraldine_VdAuwera on 2016-01-12

That’s an interesting strategy, worth exploring.

Using ploidy 1 seems a bit radical for the first step — this will favor sites where your population is very homogeneous. Sites with lots of heterogeneity may still be called but with low confidence, so you’d have to make sure to lower the emit thresholds. Or use the GVCF mode of HC since it sounds like what you’ll be looking for are sites with low probability of being hom-ref, and potentially multiple alternate alleles. You can then select these out to serve as list of interesting sites.

Assuming you plan to run step 2 on just the sites called in step 1, note that you can do this by passing in your vcf using -L, and add padding (optional but recommended) using -ip 50.?

But ploidy 100 may prove challenging in terms of performance. The number of calculations that need to be done to evaluate all possible combinations is astronomical. In future this may be more feasible if we can triage the combinations that need to be calculated, but with the current model that is not done. I’m afraid your job might run forever… You may have to compromise and use a lower ploidy even for that second step.

From Katie on 2016-01-13

Thank you. Using the GVCF mode makes a lot more sense so that I can separate single sample variant calling and then conduct multi-sample variant discovery to identify different subsets of variants which may inform (1) between sample variation and (2) to identify within-host variants.

However, I ran into trouble using the GVCF mode of HC using the following input:

java -d64 -Xmx20g -jar “/home/bioinfo/software/GATK-3.3/target/GenomeAnalysisTK.jar” \

-T HaplotypeCaller \

-R $REF \

-I $BAM \

—emitRefConfidence GVCF \

-variant_index_type LINEAR \

-variant_index_parameter 128000 \

-nct 24 \

—max_alternate_alleles 3 \

-L $INTERVAL \

—sample_ploidy 30 \

-o ${OUTDIR}${BASE}${INT}”_”${PLOIDY}”.g.vcf”

ERROR MESSAGE: you have hit a current limitation of the GVCF output model that cannot handle ploidies larger than 20 , please let the GATK team about it: 30

When I set the ploidy at 20, I got the error:

A GATK RUNTIME ERROR has occurred (version 3.3-0-geee94ec)

ERROR MESSAGE: 20

I am using GATK version 3.3-0. Is this an issue with the version? Thank you for your help!

From Sheila on 2016-01-13

@Katie

Hi,

It is possible it is an issue with the version. I forgot which exact version ploidy was introduced, but if you use the latest version you should have no problem.

-Sheila

From Katie on 2016-01-13

Great, issue resolved with the latest version. Thank you!

From Katie on 2016-01-14

One more ploidy question: the GenotypeGVCFs tool returns no output when I test it on a list of gVCFs called with a ploidy of 30. I have tested several iterations and get no error messages, but the output file only includes the VCF header. The gVCF files look reasonable.

java -d64 -Xmx20g -jar “/home/ksw9/bin/GenomeAnalysisTK.jar” -T GenotypeGVCFs -R $REF —sample_ploidy 30 -V gVCFTest.list -o test.raw.vcf

From Geraldine_VdAuwera on 2016-01-14

Are you sure the run completed? If yes, try again with a lower emit_conf_threshold. Check your GVCFs to see what is the range of quals of your calls.

Note that you don’t need to specify ploidy to GenotypeGVCFs, it will infer it from the genotypes in your GVCFs.

From Katie on 2016-01-15

Thank you. Yes, the issue is that if I run HaplotypeCaller with PLOIDY=30, all GQ scores are extremely low, due to low confidence in calling 30 alleles. When I set a lower ploidy, GQ scores increase (slightly) and it is possible to use the GenotypeGVCF tool. Even though GQ scores are low, this will help me identify sites that show evidence of within-sample polymorphism.

Thanks for the help!

From Rebs on 2018-02-20

Hi I have a question related to ploidy in UnifiedGenotyper. I am using gatk version 3.8.1 and working with P. falciparum.

I have a file with previously to sequencing pooled samples (they are not identified). The number of samples is known but the number of parasites per sample is unknown. Also, the parasite has stages where it is haploid and others where it is diploid. Which ploidy should I set when calling variants?

Thanks!

From Geraldine_VdAuwera on 2018-02-22

@Rebs Based on what you describe there is no “correct” setting for the ploidy. One possible approach is to choose what is the minimum allele fraction you aim to be able to detect, and set ploidy accordingly.

Alternatively you could try using the Mutect2 somatic caller (in tumor-only mode) which does not make ploidy assumptions.

From Rebs on 2018-02-27

Hi @Geraldine_VdAuwera

Thanks for your reply.

What do you mean with allele fraction?

Also, I performed variant calling trying different ploidys and when setting it to 1, it detects less SNPs…

Regarding Mutect2, I would prefer to stick to UnifiedGenotyper. It was recommended by some colleagues of mine which also work with the same organism (FYI they used ploidy 2 but their conditions were different from mine).

Thanks!

From Geraldine_VdAuwera on 2018-03-02

Imagine you have a pool of 10 diploid individuals and one of them has a heterozygous SNP. Assuming perfect sequence quality, what you expect to see in the data is that the variant allele will be represented by one out of every 20 reads — an allele fraction of 0.05. Considering the regular occurrence of technical artifacts, we’re not going to want to trust an allele that’s represented by very few reads. So let’s say we’re arbitrarily setting that limit at 5 (there are better ways to do this; this is a strawman) — that means you’ll need to have ~100x coverage at this site to make a confident call.

Now in your case you already have the data so you have to work backward from however much coverage you do have. Let’s say it’s ~50x coverage — that means you can only really expect to make confident calls for variants with an observed allele fraction of 0.1, ie one hom-alt in 10 diploid samples or one heterozygote in 5 diploid samples.

Does that make sense?

Re: what you see with ploidy =1, that’s expected. You’re throwing out anything that doesn’t look like a hom-alt call.