howto Call variants with HaplotypeCaller

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2013-06-17

Objective

Call variants on a single genome with the HaplotypeCaller, producing a raw (unfiltered) VCF.

Caveat

This is meant only for single-sample analysis. To analyze multiple samples, see the Best Practices documentation on joint analysis.

Prerequisites

Steps

Determine the basic parameters of the analysis
Call variants in your sequence data

1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

Genotyping mode (--genotyping_mode)

This specifies how we want the program to determine the alternate alleles to use for genotyping. In the default DISCOVERY mode, the program will choose the most likely alleles out of those it sees in the data. In GENOTYPE_GIVEN_ALLELES mode, the program will only use the alleles passed in from a VCF file (using the -alleles argument). This is useful if you just want to determine if a sample has a specific genotype of interest and you are not interested in other alleles.

Emission confidence threshold (-stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

Calling confidence threshold (-stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms "called" and "filtered" are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.

2. Call variants in your sequence data

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R reference.fa \ -I preprocessed_reads.bam \ -L 20 \ --genotyping_mode DISCOVERY \ -stand_emit_conf 10 \ -stand_call_conf 30 \ -o raw_variants.vcf

Note that -L specifies that we only want to run the command on a subset of the data (here, chromosome 20). This is useful for testing as well as other purposes, as documented here. For example, when running on exome data, we use -L to specify a file containing the list of exome targets corresponding to the capture kit that was used to generate the exome libraries.

Expected Result

This creates a VCF file called raw_variants.vcf, containing all the sites that the HaplotypeCaller evaluated to be potentially variant. Note that this file contains both SNPs and Indels.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

Updated on 2016-02-11

From BobHarris on 2016-02-09

FYI, for users like myself who are devotees of Murphy and his laws, the first mention of genotyping_mode in this tutorial has a single dash and beware, it is not a real ascii dash. If you unwittingly copy and paste that into your command it won't work.

First you'll discover you need a second dash. But even after you add that you might be misled to think that DISCOVERY isn't a valid setting, as suggested by these error messages:

ERROR MESSAGE: Invalid argument value '–-genotyping_mode' at position 6.

ERROR Invalid argument value 'DISCOVERY' at position 7.

What really confused me was if it didn't like genotypingmode why was it complaining about DISCOVERY; and if it did like genotypingmode, well, I couldn't see what was wrong with DISCOVERY. And why was it telling me it thought '–-genotyping_mode' was an invalid argument value? And what's the difference between an ERROR MESSAGE and a message telling me there's an ERROR?

These tutorials are very helpful, by the way. In spite of the pit- and prat-falls.

From Sheila on 2016-02-10

@BobHarris

Hi,

These tutorials are not meant to be copied and pasted! We want to give you an idea of the arguments you might want to use, but we do expect you to read the tool documentation and understand all of the possible arguments so you can decide which are most appropriate for your analysis. Of course, we understand users might want to copy and paste then blame us for any improper outputs, so we purposely put in hard to copy arguments.

As for the error messages, you are right they are slightly confusing, but in this case, the invalid argument value comes from the program thinking –-genotyping_mode and DISCOVERY are values for the previous argument. The ERROR MESSAGE in the beginning is stating there is indeed an error message and the second ERROR points out what the possible other errors are because there is more than one error here.

-Sheila

From Geraldine_VdAuwera on 2016-02-11

For the record, I think @Sheila is joking — we certainly do not make it difficult to copy/paste commands on purpose. We’ve had a few errors creep in because the text above was in part copy-pasted from a paper manuscript (the 2014 Current Protocols paper). I’ve fixed the errors you pointed out. If you see any others let us know and we’ll fix those too.

Regarding the error messages there is some misunderstanding due to the formatting. There is only one error message (prefaced by ERROR MESSAGE) but it’s reporting several problems, so there’s enough text to lead to multiple lines, which are all prefaced by the word ERROR by convention.

As to the nature of the errors, what you’re seeing here is a domino effect, where one error in an argument name, value or formatting can lead to further errors in subsequent arguments as the parser fails to correctly recognize which is which. The fact that the parser complains that `–-genotyping_mode` was an invalid argument value suggests that something in front of it is wrong. Fun fact: in your post (and possibly your command?) the first dash is one of those fake non-ASCII dashes… If it helps, we run into that problem fairly frequently too and it’s annoying as hell. On the bright side, once you’ve run into this problem you’re a lot more likely to recognize it immediately (and not spend a lot of time tearing your hair out) the next time it happens.

From Sheila on 2016-02-11

@BobHarris

Hi,

Yes, to clarify I was indeed joking about putting in hard to copy arguments to throw off users. We would never do anything like that, at least not on purpose! :smile:

-Sheila

From BobHarris on 2016-02-11

Right, I did know Sheila was joking.

I did in fact think that “Invalid argument value ‘–-genotyping_mode’ at position 6” meant that the parser thought the quoted text was to be a value for some preceding argument. But my command up to that point was “java -jar GenomeAnalysisTK.jar —analysis_type HaplotypeCaller —reference_sequence apple.fa —input_file orange.bam —genotyping_mode DISCOVERY”. (I’ve corrected the hyphen in that). I wasn’t able to identify what argument it might have thought “—genotyping_mode” was a value for, since the previous argument, “—input_file”, had been supplied a value, “orange.bam”. That error message could be improved if it also told me what argument the parser thinks this allegedly bad value belongs to.

The expectation that users won’t cut and paste code or keywords posted in a tutorial is ill-considered. Being myself a lousy typist, I’m (usually) much less likely to screw up an unfamiliar command line if I copy and paste from an example, editing to change the parts specific to me. I did in fact read all of the tutorial text before running my command, and did look through the output of HaplotypeCaller —help to get some brief understanding of each argument before I actually tried running the command.

If, as Geraldine writes, non-ascii dashes are a fairly frequent problem, it would seem useful to fix the problem in the parser. It would save headaches for users and support staff alike.

From Geraldine_VdAuwera on 2016-02-12

OK, just wanted to make sure because humor can be easily lost in translation, and taken at face value that statement could be a bit upsetting :)

I agree with you that tutorial commands should be cut-and-pastable, @BobHarris. The modus operandi of these tutorials is that we tell you to run a specific command line to achieve a specific purpose. Obviously we’re hoping people read the explanation of what the key arguments do, but we don’t expect anyone to customize the command while in the context of the tutorial. The warning against straight-up copy-pasting is meant for other types of documents (such as the tool docs) where the commands given are usually generic examples that we’d rather people not take for granted as applicable to their use case.

The dashes problem is not frequent enough to make me willing to have the parser start accepting other types of characters — I see that as a slippery slope to more problems. And to some extent it’s a fairly innocuous and easy to resolve problem (we recognize it very easily, at least) that introduces people to the idea that characters are not all created equal and that word processors are evil. I know that view might be a bit controversial but it’s how I feel when I have my educator hat on. Now, ask me again when I’m actually trying to run commands for my own purposes. Then there’s swearing and shaking of fists :)

From aneek on 2016-04-04

Hello,

I have a query regarding generating a single sample VCF file from BAM file for exome sequencing. Is it possible to exclude some intronic regions while generating a VCF because those intronic regions we do not want in our analysis.

Suppose if it want keep variants present only CDS +12/-12 positions. Which command should I use at this step or hardfiltering step?

Thanks

From Sheila on 2016-04-04

@aneek

Hi,

I think [this article](https://www.broadinstitute.org/gatk/guide/article?id=4133) should help.

-Sheila

From aneek on 2016-04-05

@ Sheila

Thank you for quick response. I have understood that I have to use an interval padding option with -L option to restrict my exome analysis up to CDS +12/-12 positions. However to do that should I have to use

“-L chr:1-12” option to add padding 12 bp around all exons? Please correct me if I am wrong.

From Geraldine_VdAuwera on 2016-04-05

No, that would mean you want to run on bases 1 to 12 of a contig called chr. The interval padding option is -ip.

From aneek on 2016-04-09

@Geraldine_VdAuwera

Thank you for the correction. So now if I use the following command would serve my purpose of to restrict my exome analysis up to CDS +12/-12 positions.

java -jar GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R reference.fa \ -I preprocessed_reads.bam \ -L file_interval.bed \ -ip 12 —genotyping_mode DISCOVERY \ -stand_emit_conf 10 \ -stand_call_conf 30 \ -o raw_variants.vcf

Please correct me again if I am wrong.

From Geraldine_VdAuwera on 2016-04-09

Yes that’s correct.

From aneek on 2016-04-11

@Geraldine_VdAuwera

Thank you very much for the guidance.

After getting the VCF file, if I want to filter out known variants having greater than 0.01 MAF in 1000genome, ExAC and EVS, from the VCF file itself, before going to the annotation step, is there any tool in GATK to do that?

In other words if there is any tools or option in GATK which can filter the VCF file by removing known variants having greater than 0.01 MAF in 1000genome, ExAC and EVS?

If not is there any other way to do that?

Thank you.

From Sheila on 2016-04-11

@aneek

Hi,

You will need to use [SelectVariants](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php) to select the sites in the VCFs that have greater than 0.01 MAF. Then, use SelectVariants again to remove the sites from your own VCF.

-Sheila

From aneek on 2016-04-11

@Sheila

Is their any specific example in the tutorial to do that. It is not much clear to me about what are the options should I use in the command line. I want to keep the novel variants along with the known variants having MAF < 0.01 in population in the output VCF. MAF is in terms of POPULATION metric and not the Allele FRACTION here.

Do I need to feed reference data from 1000 genome, EVS and ExAC in the commandline based on which SelectVariants will perform the task. Because what I guess that the program will need some reference with which it will compare MAF of each variant in the input VCF and will generate an output VCF having only those variants which have MAF less than equals to 0.01 along with the other variants which are novel (not reported in any of these databases).

Any help would be highly appreciated. Thank you.

From aneek on 2016-04-12

Hi, any help or suggestions regarding my queries..

From Sheila on 2016-04-12

@aneek

Hi,

I think [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/3761/rare-variants-from-annotated-vcf-file) and [this thread](http://gatkforums.broadinstitute.org/wdl/discussion/3648/gatk-filter-by-minor-allele-frequency) may help.

-Sheila

From aneek on 2016-04-13

@Sheila

Thank you very much for the information. Another thing is how can I restrict the information provided in the INFO column.

Suppose I want to generate a VCF having only “AC=4;AF=1.00;AN=4;DB;DP=76;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.21;SF=0,1;SOR=0.693” in the INFO column excluding the excess information like “BaseQRankSum=1.973;ClippingRankSum=0.932;MQRankSum=0.282;QD=28.65;ReadPosRankSum=-0.065 etc. which it provides for some variants while using “HaplotypeCaller”.

From Sheila on 2016-04-13

@aneek

Hi,

You can use [-sites_only](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#—sites_only).

-Sheila

From aneek on 2016-04-13

@Sheila

Thank you very much. It is written in the link that using this argument will generate a VCF file with first 8 columns including the INFO column (excluding the genotypes). How will it restrict the information as I have mentioned in my earlier query in the INFO column. Shall I have to put any value for this argument.

Suppose if I want to keep only “AC=4;AF=1.00;AN=4” information in the INFO column what value should I use for -sites_only argument?

In other words is there any control in GATK to restrict the amount of information in the INFO column of VCF.

Thanks..

From Sheila on 2016-04-14

@aneek

Hi,

You can use [-XA](https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#—excludeAnnotation) to exclude any annotations you don’t want.

-Sheila

From medhat on 2016-06-25

HI, Is there any way to make it run faster as I am running it “GATK 3.5” on cluster “using one node 28 core 100gb of rams “ and it takes 7 days and it is not even finished “my input two .bam samples each is 109 GB”

I think the solution may be to use the Queue system http://gatkforums.broadinstitute.org/gatk/discussion/1306/overview-of-queue but it is not clear how it will control rams or how it will speed it up

From Sheila on 2016-06-27

@medhat

Hi,

You can try using [WDL](https://software.broadinstitute.org/wdl/) which is replacing Queue and is easier to use than Queue.

-Sheila

From medhat on 2016-06-29

Thanks a lot for the answer, but after reading the documentation there is no clear way how to use it with slurm cluster and to use multi node ?

From KateN on 2016-07-12

Apologies for the delay in the response, I’ve been clarifying a few things to hopefully give you a response to satisfy all your questions.

At this time, WDL does not support a Slurm backend, so I would actually recommend Queue. We do have some presentations about getting started with Queue, which you can find [here](https://www.broadinstitute.org/gatk/documentation/presentations?id=3391).

There is no explicit ram control in Queue itself. When writing the scala script for your Queue pipeline, you can specify the usual JVM memory parameters within the GATK command. At the Slurm level, you can use jobNative arguments (listed in the Advanced Queue presentation linked above) to specify memory requirements in the Queue command.

In addition to that, you can also speed up the runtime using multithreading, which is done with the GATK argument `-nt` or `-nct`. You can read more about how that works [here](http://gatkforums.broadinstitute.org/wdl/discussion/1975/how-can-i-use-parallelism-to-make-gatk-tools-run-faster).

From mglclinical on 2016-07-20

Hi @Geraldine_VdAuwera

The default values for following parameters in HaplotypeCaller and GenotypeGVCFs are :

-stand_emit_conf 30

-stand_call_conf 30

But, for Single Sample DNA-Seq analysis, it seems to me the suggestions are :

-stand_emit_conf 10

-stand_call_conf 30

Please let me know if my understanding is correct regarding the recommended value of 10 for stand_emit_conf parameter especially in single sample DNA-Seq exome analysis.

Thanks,

mglclinical

From mglclinical on 2016-07-25

I am wondering if anyone has any opinion on my question ?

Thanks,

mglclinical

From Geraldine_VdAuwera on 2016-07-25

Hmm, aside from the crickets… frankly it depends more on the level of sensitivity you want to achieve then whether you’re working with a single sample or with a cohort. `-stand_emit_conf 10` will allow you to consider sites where the evidence is very unconvincing, and that would most likely be filtered out in the next step anyway. Keep in mind that those sites will have a status of filtered, so if you want them to be included in subsequent analysis you need to use the appropriate arguments for each tool. If you don’t want to have to deal with that, drop `stand_call_conf` too.

Note also that we’re probably going to get rid of the `stand_emit_conf` argument in the next version because we don’t find it useful any more and it causes a lot of confusion. I think it will be easier to have just one dial to turn to set the sensitivity floor.

From mglclinical on 2016-07-25

@Geraldine_VdAuwera , thank you for your opinion

From LeanStream on 2016-09-23

Hi @Geraldine_VdAuwera ,does the arguement “stand_emit_conf” have been deleted in the up-to-date nightly version？ I meet some error while call variant with HaplotypeCaller: “ERROR MESSAGE: Arguement with name ‘stand_emit_conf’ isn’t defined.”

From Sheila on 2016-09-30

@LeanStream

Hi,

Yes, that argument has been deprecated. It was supposed to be a convenience function to do a sort of “pre-filtering”, but mostly it ended up being confusing and pointless. You can still use stand_call_conf.

-Sheila

From RodenLuo on 2017-01-18

I got this error when I applied the example command. ```

ERROR MESSAGE: Invalid command line: The parameter standardminconfidencethresholdfor_emitting is deprecated. This argument is no longer used in GATK versions 3.7 and newer. Please see the online documentation for the latest usage recommendations.

```

Then I found this post. So, should I just remove the -stand_emit_conf 10?

Also, which value should be a good starting point for This also comes with a lower default value for the -stand_call_conf threshold

Thanks, Roden

From Geraldine_VdAuwera on 2017-01-18

Yes that’s correct. We’ll update the tutorial accordingly. Thanks for reporting this!

From terestahl on 2017-01-24

Hello, Could you please comment which will be a good starting value for -stand_call_conf when performing HaplotypeCaller in RNA-seq data. Thanks

From Geraldine_VdAuwera on 2017-01-24

The RNAseq doc specifies a call threshold of 20.

From abhishek_maj08 on 2017-02-10

@Geraldine_VdAuwera

Hi,

I have 19 samples (WGS). As recommended by best practices, I ran HaplotypeCaller on individual samples to generate the corresponding gVCF file. The problem is, it takes a huge amount of time (~99 hours). Is there a way to shorten this time amount.

Also I was curious, if I called HaplotypeCaller with multiple BAM files (different samples) directly to generate raw variants, will it give me incorrect results?

I mean for exome data, I would definitely first generate gVCFs; but for WGS data, given that it takes so much time, can we directly call HaplotypeCaller for vcf files (in the interest of time saving).

Thanks

Abhishek

From Geraldine_VdAuwera on 2017-02-10

@abhishek_maj08 Calling HC on multiple samples takes a lot of time as well — the computationl resource expenditure increases dramatically. So we parallelize HaplotypeCaller over intervals to accelerate processing. We have a intervals that are appropriate for WGS, if you need it.

From abhishek_maj08 on 2017-02-10

@Geraldine_VdAuwera Yes please. It will be very helpful if you provide the intervals. Also just to clarify, when you are talking about parallelizing HC, are you talking about generating GVCF or the joint analysis?

From Geraldine_VdAuwera on 2017-02-10

We parallelize both over the same intervals. Our workflow for the first part of the pipeline (up to GVCF) is available [here](https://github.com/broadinstitute/wdl/tree/50b4dd09314c2c89177b39f92ab7bcfa280b39e0/scripts/broad_pipelines) if you want to check it out, and we have some additional workflows coming out soon for the rest.

The intervals lists are in our resource bundle — see the Downloads page for details of how to access it.

From abhishek_maj08 on 2017-02-14

@Geraldine_VdAuwera

Thank you for the workflow. I was looking at the wgs_calling_regions.hg38.interval_list and I observed that there are chunks in each chromosome that have been left out. Could you please tell me how these intervals were selected/chosen? I mean why a segment of a chromosome is selected or left out?

From shlee on 2017-02-16

Hi @abhishek_maj08,

If these are the intervals I’m thinking of, I believe they are an intersection of genomic regions surrounded by gaps (NNs) and regions that exclude low complexity satellite regions for which calling would be a waste of compute for most researchers.

From abhishek_maj08 on 2017-03-07

Geraldine_VdAuwera shlee

Thank you for your reply.

Let me tell you what I am trying to do. I am planning to run GATK on WGS data of Rhesus macaque. I want to identify similar interval regions in the rhesus genome as you have done in the human genome for my WGS GATK run. To that end I would like to follow the steps that you have used. Currently I have run RepeatMasker (with default settings) on the rhesus genome, to get a list of masked regions. Now, do I simply take the non-masked regions and use it as my interval list? Or are there intermediate steps which should be done, before desired interval list will be obtained.

Kindly let me know the steps you followed to regenerate the human wgs_calling_regions.hg38.interval_list. That way I can use it as a guideline to get the corresponding rhesus interval list and run my experiments.

From shlee on 2017-03-08

Hi @abhishek_maj08,

We’re not familiar with RepeatMasker so it’s hard for us to comment meaningfully on your exact question. What I can say is that Picard has a tool, [ScatterIntervalsByNs](https://broadinstitute.github.io/picard/command-line-overview.html#ScatterIntervalsByNs), that can help you create intervals between gapped regions. You can also examine the gapped regions in a file that UCSC provides. I talk about it briefly [here](http://gatkforums.broadinstitute.org/gatk/discussion/7857/reference-genome-components) and give the ftp address for the human equivalent. You may find other file types that are useful towards your aims in the same directory.

The tool to use to create an intersection of intervals is Picard’s [IntervalListTools](https://broadinstitute.github.io/picard/command-line-overview.html#IntervalListTools). I hope this is helpful.

From mscjulia on 2017-05-24

Dear GATK team,

I am trying to decide if I should use HaplotypeCaller or MuTect2 for one set of my samples. You mentioned in one document that “… HaplotypeCaller is able to handle non-diploid organisms as well as pooled experiment data. Note however that the algorithms used to calculate variant likelihoods is not well suited to extreme allele frequencies (relative to ploidy) so its use is not recommended for somatic (cancer) variant discovery. For that purpose, use MuTect2 instead.”

So what kind of “extreme allele frequencies” we are talking about here? 2 Ts and 50 As, or 2 Ts and 1000 As? Is there a parameter designed to control it please? Thanks a lot.

From shlee on 2017-05-25

Hi @mscjulia, please check HaplotypeCaller options in [the tool doc](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php) for its various options. I believe one of the parameters you want to tweak for testing out your questions is `—standard_min_confidence_threshold_for_calling`. Also, take a look at [this dictionary entry](http://gatkforums.broadinstitute.org/gatk/discussion/8603/heterozygosity) that relates to `—heterozygosity` and `—heterozygosity_stdev` options. I’ll have to check with a developer for more details.

From mscjulia on 2017-05-25

@shlee Thank you very much. I went over the two documents and have a better understanding now, but still I would like to know a exact cut off like I mentioned previously. Looking forward to hearing from you. Thank you.

From shlee on 2017-05-30

Hi @mscjulia,

Our developer says that GATK4 Mutect2 will be able to handle the low allele fractions you mention (2/50, 2/1000). For how Mutect2 handles the math, we have equations at .

As for exact cutoffs, because these are dependent on base qualities, you will have to test the cutoffs for your own data given your base qualities. I would suggest [simulating reads](http://gatkforums.broadinstitute.org/gatk/discussion/7859/how-to-simulate-reads-using-a-reference-genome-alt-contig) that provide a variant and that reflect your data’s base qualities to the depths you are interested in then calling on these.

I will also keep your question in mind for future documentation purposes as an illustrative example.

From mscjulia on 2017-05-30

@shlee Thank you very much! I appreciate all the help. One more question, I know that MuTect 2 doe not assume diploid, but if we have reasons to believe our samples are mostly diploid, would it actually suits MuTect2? Thanks!

From shlee on 2017-05-31

@mscjulia MuTect2 can handle sliding ploidies. My advice is to test out both tools for the data and question you need to answer.

From ShawnWilliams on 2017-08-23

Dear GATK team: I use haplotypecaller to call variants on my data. I ran it twice, and it crush at the same place, I wonder is it my data that cause the problem? Or maybe other reasons? and the log is: WARN 16:01:59,604 HaplotypeCallerGenotypingEngine - location X:101409254-101409259: too many alternative alleles found (7) larger than the maximum requested with -maxAltAlleles (6), the following will be dropped: TTTTTT. INFO 16:02:05,819 ProgressMeter - X:103762026 2.881033286E9 2.5 h 3.0 s 95.1% 2.6 h 7.7 m

ERROR --

ERROR stack trace

java.util.NoSuchElementException at java.util.HashMap$HashIterator.nextNode(HashMap.java:1439) at java.util.HashMap$KeyIterator.next(HashMap.java:1461) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.reduceNumberOfAlternativeAllelesBasedOnLikelihoods(HaplotypeCallerGenotypingEngine.java:336) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:264) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:964) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:251) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version exported):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Code exception (see stack trace for error itself)

ERROR ------------------------------------------------------------------------------------------

From Sheila on 2017-08-23

@ShawnWilliams

Hi,

This looks like a bug that has been fixed in the latest version. Are you using the latest version of GATK?

-Sheila

From ShawnWilliams on 2017-09-09

@Sheila

Hi，

I use gatk3.6. I have solved this problem, It’s my data’s problem. My data has some read which read Length is 0, and I use —filter_bases_not_stored in BQSR pipeline, then haplotypeCaller can run normally.

From Frank_cumc on 2017-09-20

Hello, I am also experiencing this problem. When I run the ValidateSamFile command on the bam file after base recalibration, it says no errors found. This is what happens when I try to run HaplotypeCaller with -forceActive and -disableOptimizations but it works when I remove these parameters. Thank you for the help

ERROR --

ERROR stack trace

java.lang.IllegalArgumentException: READMAXLENGTH must be > 0 but got 0 at org.broadinstitute.gatk.utils.pairhmm.PairHMM.initialize(PairHMM.java:105) at org.broadinstitute.gatk.utils.pairhmm.N2MemoryPairHMM.initialize(N2MemoryPairHMM.java:60) at org.broadinstitute.gatk.utils.pairhmm.LoglessPairHMM.initialize(LoglessPairHMM.java:66) at org.broadinstitute.gatk.utils.pairhmm.PairHMM.initialize(PairHMM.java:138) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.initializePairHMM(PairHMMLikelihoodCalculationEngine.java:276) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeReadLikelihoods(PairHMMLikelihoodCalculationEngine.java:291) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:950) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:252) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: READMAXLENGTH must be > 0 but got 0

ERROR ------------------------------------------------------------------------------------------

From Sheila on 2017-09-25

@Frank_cumc

Hi,

This might be a bug, but before I ask you to submit a bug report, can you check with GATK4?

Thanks,

Sheila

From Jaqx008 on 2018-01-13

Hello Guys. I am trying to perform variant (only SNPs) calling on whole genome with the GATK tool. I was told to instal picard, gatk and other things. But I was unable to install the GATKtool. I was told also to just point the command towards the folder containing the gatk but upon running the following command, I get this error. Please let me know if something is wrong with my command and if this can help me generate snps. Btw, I am a bioinformatics beginner.

And lastly, do I need to install the haplotype caller as well? Help!

java -jar $gatk-4.0.0.0/gatk -T HaplotypeCAller -R /Volumes/500GB/pavelAssemblyBowtieIndex/DfarGenome.fa -I /Volumes/500GB/pavelAssemblyBowtieIndex/dustMapping.sorted.bam -o SNPs.vcf

Error: Unable to access jarfile HaplotypeCAller

From Sheila on 2018-01-17

@Jaqx008

Hi,

I think the [quick start guide](https://software.broadinstitute.org/gatk/documentation/quickstart) will help you.

-Sheila

Report abuse