198. The GATK reference model pipeline for incremental joint discovery in full detail

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2014-03-06

Okay, we realize the name’s a bit of a mouthful, and we’re willing to tweak it if anyone has any good ideas. But never mind that. It’s difficult to overstate the importance of this new approach to joint variant discovery (but I’ll give it a shot) so we’re really stoked to finally be able to share the details of how it’s is going to work in practice.

You’re probably going to be surprised at how simple it is in practice (not that it was particularly easy to actually build, mind you). The gory details are in [the new document here](http://www.broadinstitute.org/gatk/guide/article?id=3893), but here’s an overview of how it looks within the Best Practices workflow you all know and (hopefully) love:

——

The first surprise is that instead of calling variants on multiple samples, you now just run HaplotypeCaller on each sample individually. “Oh no,” I hear you cry, “but the results were so much better when I called multiple samples together!”. Well yeah, but it took forever. Bear with me for a minute.

The key here is that you run HaplotypeCaller in gVCF mode. This outputs a so-called [genomic VCF](http://www.broadinstitute.org/gatk/guide/article?id=2940), which contains a record of the genotype likelihoods and annotations for every single site in the genome (or exome), whether or not there is evidence of variation. This essentially boils down all the useful information that can be gleaned from the BAM files, and makes it unnecessary to refer back to the BAM in later steps.

So you repeat that for all your samples (which goes reasonably fast since per-sample calling is pretty tractable nowadays). Optionally, you can add in a step to combine gVCF files if you’re working on a really large cohort. Then in the next step, you just run a separate genotyping tool on all the gVCFs (or combined gVCFs) together, which gives you the same output (raw SNPs and indel calls) that you would have got from one-step multisample calling.

See, that’s the beauty of the new workflow. A lot less work (for the computer) for equivalent results. And the ability to process samples incrementally and perform joint discovery on cohort sizes that would have previously got you hauled off to the funny farm.

Let us know what you think!

Updated on 2014-03-07

From mgymrek on 2014-03-07

The link to the “gory details” just goes to this page. Where can I find these gory details?

From Geraldine_VdAuwera on 2014-03-07

Link fixed.

From Bettina_Harr on 2014-03-23

Hi Geraldine,

I have started this program on my bam file containing all chromosomes (mouse). It managed to call genotypes for 3 chromosomes in 5 days. I cannot run programs for longer than that due to limits in wall time. Do you recommend to run each chromosome separately, maybe? Or is there any other way to speed up the calculations?

Thanks and best,

Bettina

From mike_boursnell on 2014-03-24

Hi Geraldine,

This new method seems terribly slow. I am running a batch of BAM files which are about 2.5GB each, over only a 14MB region of the genome, and each one is taking about 12 hours. I am using nct=1 because I can’t risk using higher values as that was unreliable in the past.

Am I doing something wrong?

java -Xmx4g -Djava.io.tmpdir=javatempdir -jar GenomeAnalysisTK.jar -R canfam3.fasta -T HaplotypeCaller —emitRefConfidence GVCF —variant_index_type LINEAR —variant_index_parameter 128000 -L chr13:38900000-53300000 -I AS_case_07_final.bam -o gVCF1_AS_case_07_final_variants.vcf -S STRICT -nct 1

From Geraldine_VdAuwera on 2014-03-24

Bettina_Harr‌ and mike_boursnell‌ That’s odd as in our hands this goes much faster than the previous versions. Are you working with very deep coverage? You can certainly parallelize by running on each chromosome individually, to help cut down on runtimes.

From Bettina_Harr on 2014-03-24

Hi Geraldine, no my coverage is not extremely high, average 20-fold per site.

From mike_boursnell on 2014-03-24

Quite high coverage 300-400. What is the 128000 parameter?

From Geraldine_VdAuwera on 2014-03-24

The 128000 parameter has to do with the index compression scheme — I think it specifies block size.

Let me call in some of the devs who have been working on the HC’s performance. Are you running 3.0 or 3.1?

From Bettina_Harr on 2014-03-24

Hi Geraldine, ok, I have restarted the program processing one chromosome at a time. Looking at my intermediate output, I noticed that the software issued a warning:

WARN 02:33:07,773 DiploidExactAFCalc – this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 1:65884356 has 8 alternate alleles so only the top alleles will be used; see the —max_alternate_alleles argument

I cannot see how a single diploid individual could have 6 or even more alleles. Is this maybe due to some simple sequence repeats/indels? What is the rationale behind allowing 6 different alleles at a site?

Thanks!

From Bettina_Harr on 2014-03-24

oh, I am running 2.8.1. Did not know there was already a new version. Could this be part of the reason? I have done all my pre-processing now with 2.8.1. Can I switch to 3.1 with the HaplotypeCaller? Or should I stay with 2.8.1?

Thanks!

From Geraldine_VdAuwera on 2014-03-24

> I cannot see how a single diploid individual could have 6 or even more alleles. Is this maybe due to some simple sequence repeats/indels? What is the rationale behind allowing 6 different alleles at a site?

Yes, this is typically due to indels in a context with homopolymers and/or repeats. You can limit the number of max alt alleles if you want, which will make the program run faster.

> I am running 2.8.1.

You should definitely switch to 3.1; this new method is specifically designed for work in 3.1, and involves new code that is not in 2.8.

From Bettina_Harr on 2014-03-24

ok, but I do not need to repeat the pre-processing, i.e. indel and base recalibration, when switching to 3.1 now, correct?

From Carneiro on 2014-03-24

Correct @Bettina_Harr‌ you don’t need to reprocess (BQSR & Indel Realignment) to switch. Just run the haplotype caller. I actually didn’t even know that we made the reference model available on 2.8, it was very experimental at that point. Try 3.1 and let us know how it is.

From Geraldine_VdAuwera on 2014-03-24

@mike_boursnell To clarify, are you finding that running HC in GVCF mode is much slower than previously?

From mike_boursnell on 2014-03-26

Yes. That’s right.

From Geraldine_VdAuwera on 2014-03-26

@mike_boursnell‌ Do you have any comparison data, e.g. runtimes for running two different commands on the same data? We’d like to get a sense of what the difference is. The HC does do quite a bit more work in GVCF mode, so it’s expected that the upfront runtime will be a little more, but not orders of magnitude.

From mike_boursnell on 2014-03-26

OK, I’ll have a look. I’m not sure how to compensate for other things running on our server at the same time. Any suggestions?

From Geraldine_VdAuwera on 2014-03-27

Not really — if you were able to reserve a set of nodes that would be ideal of course. Not knowing your setup/access I can’t really advise. We have dedicated machines for testing this kind of thing so it’s a lot easier.

From sergiomv on 2014-04-15

Hi Geraldine,

I run GATK 3.1 Unified Genotyper (UG) and Haplotype Caller (HC) for whole genome NA12878. With UG I have my vcf file with all variants called (near 4 milions), but when I run HC I obtained very few variants (near 1,3 milions) and in the log of HG I see this warning (a lot of): WARN 05:26:17,926 ExactAFCalc – this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr1:7458214 has 7 alternate alleles so only the top alleles will be used; see the —max_alternate_alleles argument. My BWA alignment is OK I see only two alternate alleles (diploid). Do you know why of this warning ? Thanks.

chr1-7458214.PNG

From Geraldine_VdAuwera on 2014-04-15

Hi @sergiomv‌,

First I would say that the different numbers of variants is most probably unrelated to the warnings you’re seeing. I can’t really comment on those numbers; you should try to examine what is the overlap and what are the major differences between the two callsets. Have a look at the distribution of annotation values. See if you are missing major blocks of variants, or if there is a correlation with annotation values, e.g. if the variants HC is missing all have low MQ in the UG callset. HaplotypeCaller is more stringent about thing slike MQ than UG.

Regarding the alleles question, I would bet that HC is seeing multiple possible deletion alleles in your data, because of the low complexity of the context. Keep in mind that the HC does a de novo reassembly step with the data so it may come to a different alignment than BWA. And since considering multiple alleles decreases performance, the HC is set to only proceed with the most likely alleles and discard the rest. You don’t need to worry about this.

From Bettina_Harr on 2014-04-22

Hi Geraldine, a couple of questions:

1) I ran the Haplotype Caller and in the end I am getting a short summary of my reads.

INFO 05:16:33,184 ProgressMeter – Total runtime 30549.58 secs, 509.16 min, 8.49 hours INFO 05:16:33,185 MicroScheduler – 5481436 reads were filtered out during the traversal out of approximately 21788872 total reads (25.16%) INFO 05:16:33,186 MicroScheduler – -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 05:16:33,186 MicroScheduler – -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 05:16:33,187 MicroScheduler – -> 5434226 reads (24.94% of total) failing HCMappingQualityFilter INFO 05:16:33,187 MicroScheduler – -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 05:16:33,188 MicroScheduler – -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 05:16:33,188 MicroScheduler – -> 47210 reads (0.22% of total) failing NotPrimaryAlignmentFilter INFO 05:16:33,188 MicroScheduler – -> 0 reads (0.00% of total) failing UnmappedReadFilter

I am wondering about those 25% reads that were “filtered” during traversal. Could you tell me if I have to be concerned about this, and what that actually means?

From Bettina_Harr on 2014-04-22

2) I am still struggling with hitting the wall time when I am trying to run Haplotype caller on the “large” mouse chromosomes, i.e. chr1. It takes more than 4 days. We have a arge computing cluster, I tried to start Java with 16 processors and 20g RAM, but the program would not run any faster. I also set the max alternate allele to 2 hoping it would save some time, but it does not. I attach my jobscript below. Is there anything else I could try to speed things up? I am now using version 3.1.1.

Thanks in advance!

#!/bin/sh #BSUB -q fat-long #BSUB -R scratch #BSUB -W 120:00 #BSUB -o TP7-10F1A2.recal.bamX_out #BSUB -e TP7-10F1A2.recal.bamX_error #BSUB -R ‘span[hosts=1]‘ #BSUB -n 16 java -Xmx20g -jar /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -T HaplotypeCaller -R Mus_musculus.GRCm38.74.dna.chromosome.fa -I TP7-10F1A2.recal.bam —max_alternate_alleles 2 —emitRefConfidence GVCF -L X —variant_index_type LINEAR —variant_index_parameter 128000 -o TP7-10F1A2.recal.bamX.gVCF

From Geraldine_VdAuwera on 2014-04-22

Hi @Bettina_Harr‌,

1) HaplotypeCaller applies more stringent quality filters than our other tools. Any reads that have a mapping quality less than 20 (if I remember correctly) will be ignored. In your case it looks like many reads have a mapping quality that’s considered too low to be useful, unfortunately. This depends mainly on the aligner you used and the quality of your data.

2) You can try using a more aggressive pruning setting (see [here](http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_haplotypecaller_HaplotypeCaller.html#—minPruning)) and/or adding a layer of multithreading if your cluster allows it (see [here](http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_CommandLineGATK.html#-nct)). In addition, if you this is exome data, you can provide a list of capture targets using the -L argument.

From Bettina_Harr on 2014-04-22

thank you Geraldine. One more short question: does it make sense to set anything for the -Xmx parameter?

From Bettina_Harr on 2014-04-22

Hi Geraldine, fwi, the -nt option does not seem to be supported in the Haplotype caller. I am testing the -nct option now.

INFO 01:45:01,617 HelpFormatter - -------------------------------------------------------------------------------- INFO 01:45:01,620 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21 INFO 01:45:01,621 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 01:45:01,621 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 01:45:01,626 HelpFormatter - Program Args: -T HaplotypeCaller -nt 2 -nct 4 -R Musmusculus.GRCm38.74.dna.chromosome.fa -I TP7-10F1A2.recal.bam --maxalternatealleles 2 --emitRefConfidence GVCF -L X --variantindextype LINEAR --variantindexparameter 128000 -o TP7-10F1A2.recal.bamX.gVCF INFO 01:45:01,630 HelpFormatter - Executing as bharr@gwda036 on Linux 2.6.32-431.1.2.el6.x8664 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.045-b18. INFO 01:45:01,631 HelpFormatter - Date/Time: 2014/04/23 01:45:01 INFO 01:45:01,631 HelpFormatter - -------------------------------------------------------------------------------- INFO 01:45:01,632 HelpFormatter - -------------------------------------------------------------------------------- INFO 01:45:02,319 GenomeAnalysisEngine - Strictness is SILENT INFO 01:45:02,461 GenomeAnalysisEngine - Downsampling Settings: Method: BYSAMPLE, Target Coverage: 250 INFO 01:45:02,471 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 01:45:02,522 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.05 INFO 01:45:02,595 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 01:45:02,604 IntervalUtils - Processing 171031299 bp from intervals INFO 01:45:02,617 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 4 CPU thread(s) for each of 2 data thread(s), of 64 processors available on this machine

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Invalid command line: Argument nt has a bad value: The analysis HaplotypeCaller currently does not support parallel execution with nt. Please run your analysis without the nt option.

ERROR ------------------------------------------------------------------------------------------

From Geraldine_VdAuwera on 2014-04-23

@Bettina_Harr‌ If you’re using -nct (which is what I meant to recommend, sorry if that wasn’t clear) you may want to increases memory allocation, yes. see the FAQs on multithreading for some general guidelines.

From Bettina_Harr on 2014-04-24

unfortunately, I am not very experienced with parallelism. I ran the code overnight WITHOUT a special -Xmx parameter (I did not know what reasonable value I should use in the multi-processing mode). It runs fine for about 5h and then exists with "exit code 1). The error file contains this at the end. Do you know what could be the problem? Simply ran out of memory? I used -ntc 6 and reserved 10 processors in my jobscript, to be on the safe side.

INFO 01:42:29,369 ProgressMeter - 1:85086078 0.00e+00 7.4 h 15250.3 w 43.5% 17.0 h 9.6 h INFO 01:43:29,371 ProgressMeter - 1:85087145 0.00e+00 7.4 h 15250.3 w 43.5% 17.0 h 9.6 h

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeDiploidHaplotypeLikelihoods(PairHMMLikelihoodCalculationEngine.java:443) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeDiploidHaplotypeLikelihoods(PairHMMLikelihoodCalculationEngine.java:417) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.calculateGLsForThisEvent(GenotypingEngine.java:385) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.assignGenotypeLikelihoods(GenotypingEngine.java:222) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:880) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:141) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Code exception (see stack trace for error itself)

ERROR ------------------------------------------------------------------------------------------

From Geraldine_VdAuwera on 2014-04-24

This looks like it might be a bug, actually. If you can isolate the region in which this issue occurred and produce a test case, we’ll have a look and try to fix it. Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

From Bettina_Harr on 2014-04-24

Hi Geraldine, ok, I perfomed the following test: the program generated the error after position 85299925 on chromosome 1 (see error below). Therefore I re-ran it with the L option (-L 1:85298925-85309925 ):

java -jar /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -T HaplotypeCaller -L 1:85298925-85309925 -nct 6 -R Musmusculus.GRCm38.74.dna.chromosome.fa -I 16B.recal.bam --maxalternatealleles 2 --emitRefConfidence GVCF --variantindextype LINEAR --variantindex_parameter 128000 -o 16B.recal.bam1.gVCF1

This generated no error and a normal looking gVCF file.

I think it is not a bug in itself, as the error only arises with chromosome 1 files, and not with any other chromosome of the same individual mouse.

I have now rerun the chr1 files with the -Xmx option set to 20g, and reserving 20 processors for it (each processor has 4Gb memory allocated to it. The run is still waiting in the queue, so I will only be able to say what happened tomorrow morning.

When I start the program interactively, and then use "top" to monitor the memory I can see it uses 18.6 Gb. Maybe -Xmx20g is still not enough?

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

23772 bharr 20 0 18.6g 842m 11m S 781.1 1.3 0:16.51 java

error file:

INFO 22:32:03,103 ProgressMeter - 1:85296437 0.00e+00 4.2 h 15250.3 w 43.6% 9.7 h 5.5 h INFO 22:32:33,105 ProgressMeter - 1:85299925 0.00e+00 4.2 h 15250.3 w 43.6% 9.7 h 5.5 h WARN 22:32:33,427 ExactAFCalc - this tool is currently set to genotype at most 2 alternate alleles in a given context, but the context at 1:85287439 has 4 alternate alleles so only the top alleles will be used; see the --maxalternatealleles argument

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextEntry(Unknown Source) at java.util.LinkedHashMap$EntryIterator.next(Unknown Source) at java.util.LinkedHashMap$EntryIterator.next(Unknown Source) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.addMiscellaneousAllele(GenotypingEngine.java:257) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.assignGenotypeLikelihoods(GenotypingEngine.java:227) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:880) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:141) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

From Geraldine_VdAuwera on 2014-04-24

Ack, this is worse than I thought. “ConcurrentModificationException” means this is most likely a problem with multithreading; it seems several threads may be trying to modify the same information at the same time, which is bad and causes the crash you see. This might be triggered by something in your data. At this point I would recommend not using multithreading at all with HaplotypeCaller. Instead of multithreading you may want to run HaplotypeCaller individually per chromosome, then concatenate the resulting variants (e.g. using the CatVariants utility). This will enable you to still use your cluster for parallelism without running into this thread safety issue.

From Kurt on 2014-04-24

I've been wondering about that. I've been using both -nct and -pairHMM VECTOR blah when running in gvcf mode and I have been experiencing random crashes that aren't reproducible (I run it again and it goes through fine) on human exomes. About 1 or 2 crashes per 100 runs so far. My stack trace is not the same as above...I haven't had time to look at all of them, but below is my latest crash that I've already rerun through...I'm actually going to do a sanity check through my last 150 to see if there were any that crashed that I missed.

`##### ERROR stack trace java.lang.NullPointerException at net.sf.samtools.SAMRecordCoordinateComparator.compare(SAMRecordCoordinateComparator.java:51) at net.sf.samtools.SAMRecordCoordinateComparator.compare(SAMRecordCoordinateComparator.java:41) at java.util.TimSort.binarySort(TimSort.java:265) at java.util.TimSort.sort(TimSort.java:208) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.broadinstitute.sting.utils.sam.ReadUtils.sortReadsByCoordinate(ReadUtils.java:320) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.finalizeActiveRegion(HaplotypeCaller.java:1111) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.referenceModelForNoVariation(HaplotypeCaller.java:1001) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:808) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:141) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Code exception (see stack trace for error itself)

From Bettina_Harr on 2014-04-24

Hi Geraldine, I am already running everything on a per chromosome basis and this is were I hit my runtime limits with chromosome 1. It takes more than four days, even with allowing for 20g when I start Java. It seems that maybe I have to subdivide chromsome 1 in chuncks to be able to do it :-(!

From Geraldine_VdAuwera on 2014-04-24

Wait – you’re saying it’s taking four days per chromosome per sample? What kind of coverage do you have?

From Bettina_Harr on 2014-04-24

medium coverage ~20 fold.

From Bettina_Harr on 2014-04-24

these are my CPU times (in seconds):

TP7-10F1A2.recal.bamMT_out: CPU time : 73

TP7-10F1A2.recal.bamX_out: CPU time : 26442

TP7-10F1A2.recal.bamY_out: CPU time : 42362

TP7-10F1A2.recal.bam19_out: CPU time : 45158

TP7-10F1A2.recal.bam15_out: CPU time : 70149

TP7-10F1A2.recal.bam18_out: CPU time : 91582

TP7-10F1A2.recal.bam16_out: CPU time : 113535

TP7-10F1A2.recal.bam14_out: CPU time : 116572

TP7-10F1A2.recal.bam12_out: CPU time : 123313

TP7-10F1A2.recal.bam17_out: CPU time : 132176

TP7-10F1A2.recal.bam3_out: CPU time : 133359

TP7-10F1A2.recal.bam7_out: CPU time : 143940

TP7-10F1A2.recal.bam8_out: CPU time : 145699

TP7-10F1A2.recal.bam6_out: CPU time : 148685

TP7-10F1A2.recal.bam11_out: CPU time : 150079

TP7-10F1A2.recal.bam9_out: CPU time : 153481

TP7-10F1A2.recal.bam10_out: CPU time : 160582

TP7-10F1A2.recal.bam13_out: CPU time : 163721

TP7-10F1A2.recal.bam5_out: CPU time : 196669

TP7-10F1A2.recal.bam2_out: CPU time : 253768

TP7-10F1A2.recal.bam4_out: CPU time : 258922

TP7-10F1A2.recal.bam1_out: CPU time : 355144

From Geraldine_VdAuwera on 2014-04-25

That seems awfully slow. Unfortunately I can’t really help you deal with performance and platform issues as that is beyond the scope of support I can provide, sorry.

From Bettina_Harr on 2014-04-26

Hi Geraldine,

ok, I got it to work. With this configuration, the whole chromosome 1 takes a little more than 1 day. So it is not a bug after all using the -nct option. One just seems to need tons of memory.

#!/bin/sh #BSUB -q fat-long #BSUB -R scratch #BSUB -W 120:00 #BSUB -o 14.recal.bam1_out #BSUB -e 14.recal.bam1_error #BSUB -R ‘span[hosts=1]‘ #BSUB -n 20

java -jar -Xmx40g /usr/product/bioinfo/GATK/3.1.1/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 6 -R Mus_musculus.GRCm38.74.dna.chromosome.fa -I 14.recal.bam —max_alternate_alleles 2 —emitRefConfidence GVCF -L 1 —variant_index_type LINEAR —variant_index_parameter 128000 -o 14.recal.bam1.gVCF

From mike_boursnell on 2014-06-27

Hi. Can we use this gVCF method to analyse a bunch of BAM files all from the same sample (just so they can be analysed in parallel?) If so how would you have to set up the Read Group information in these BAM files?

From Geraldine_VdAuwera on 2014-06-28

@mike_boursnell I’m not sure what you mean, but if you’re saying you want to compare the BAMs as if they belonged to different samples (e.g. to compare results on technical replicates), then all you need to do is give them different SM tags so that GATK treats them separate when you do the GenotypeGVCFs step.

From mike_boursnell on 2014-06-30

This is for whole genome sequencing of a single sample. We are looking at splitting the single FASTQ file into (say) 10 smaller bits and then processing them in parallel. When it gets to the HaplotypeCaller gVCFs stage there will be 10 separate BAM files, but which are all actually the same sample. If we give them the same SM tag will GATK assume they are the same sample and produce a single-column VCF file?

From Geraldine_VdAuwera on 2014-06-30

@mike_boursnell‌ Oh OK, got it. Yes, you should be able to just pass in the separate bam files. As long as they have the same SM tag, the data will get re-aggregated by the engine and served appropriately to the HC, leading to a single-sample VCF file.

From Bettina_Harr on 2014-07-02

Hi Geraldine,

I have now completed the variant calling using Haplotype Caller and recalibrated my variants. It all worked but I am now in a position where my outfiles contain ONLY variable sites, and not the invariable ones. The vcf files that “GenotypeGVCFs” produces does no longer contain the invariable sites. I would like to have the info on all sites in the final vcf file, even for those sites that are not variable as Id like to see their coverage. is this possible at all at this point?

Thanks in advance!

From Geraldine_VdAuwera on 2014-07-02

Hi @Bettina_Harr‌,

Yes, you can get that using the `-inv` argument with GenotypeGVCFs. However, in the current version this will produce no-call site records to be emitted, so the hom-ref sites won’t have much information attached to them. I believe the next version will produce detailed records at all sites.

From Bettina_Harr on 2014-07-05

thank you Geraldine! I have another question. The R script does not automatically run on my machine, so I was trying to run the script manually by typing:

library(ggplot2)

> source(“7.recalibrate_SNP_plots.R”)

Error: ‘opts’ is deprecated. Use ‘theme’ instead. (Defunct; last used in version 0.9.1)

I am getting the error above.

I am using ggplot2 1.0.0.

R version 3.1.0 (2014-04-10) — “Spring Dance“

Platform: x86_64-apple-darwin13.1.0 (64-bit)

Simply replacing “opts” with “theme” does not do the trick, I will get another error then:

Error: theme_rect is deprecated. Use ‘element_rect’ instead. (Defunct; last used in version 0.9.1)

Thanks!

Bettina

From Geraldine_VdAuwera on 2014-07-06

Hi @Bettina_Harr‌,

There are a few other functions related to the opt -> theme change that also need to be adapted. The error message tells you which. Just step through the script and change every one that throws an error the same way.

Report abuse