created by Geraldine_VdAuwera
on 2012-12-14
This document provides technical details and recommendations on how the parallelism options offered by the GATK can be used to yield optimal performance results.
As explained in the primer on parallelism for the GATK, there are two main kinds of parallelism that can be applied to the GATK: multi-threading and scatter-gather (using Queue or Crom/WDL).
There are two options for multi-threading with the GATK, controlled by the arguments -nt
and -nct
, respectively, which can be combined:
-nt / --num_threads
controls the number of data threads sent to the processor-nct / --num_cpu_threads_per_data_thread
controls the number of CPU threads allocated to each data threadFor more information on how these multi-threading options work, please read the primer on parallelism for the GATK.
Memory considerations for multi-threading
Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use -nt 4
, the multithreaded run will use 8 Gb of memory. In contrast, CPU threads will share the memory allocated to their “mother” data thread, so you don’t need to worry about allocating memory based on the number of CPU threads you use.
Additional consideration when using -nct
with versions 2.2 and 2.3
Because of the way the -nct
option was originally implemented, in versions 2.2 and 2.3, there is one CPU thread that is reserved by the system to “manage” the rest. So if you use -nct
, you’ll only really start seeing a speedup with -nct 3
(which yields two effective "working" threads) and above. This limitation has been resolved in the implementation that will be available in versions 2.4 and up.
For more details on scatter-gather, see the primer on parallelism for the GATK and the documentation on pipelining options.
Please note that not all tools support all parallelization modes. The parallelization modes that are available for each tool depend partly on the type of traversal that the tool uses to walk through the data, and partly on the nature of the analyses it performs.
| Tool | Full name | Type of traversal | NT | NCT | SG | |:------|:-----------|:---------------------|:---:|:-----:|:----:| | RTC | RealignerTargetCreator | RodWalker | + | - | - | | IR | IndelRealigner | ReadWalker | - | - | + | | BR | BaseRecalibrator | LocusWalker | - | + | + | | PR | PrintReads | ReadWalker | - | + | - | | RR | ReduceReads | ReadWalker | - | - | + | | HC | HaplotypeCaller | ActiveRegionWalker | - | (+) | + | | UG | UnifiedGenotyper | LocusWalker | + | + | + |
Note that while HaplotypeCaller supports -nct
in principle, many have reported that it is not very stable (random crashes may occur -- but if there is no crash, results will be correct). We prefer not to use this option with HC; use it at your own risk.
The table below summarizes configurations that we typically use for our own projects (one per tool, except we give three alternate possibilities for the UnifiedGenotyper). The different values allocated for each tool reflect not only the technical capabilities of these tools (which options are supported), but also our empirical observations of what provides the best tradeoffs between performance gains and commitment of resources. Please note however that this is meant only as a guide, and that we cannot give you any guarantee that these configurations are the best for your own setup. You will probably have to experiment with the settings to find the configuration that is right for you.
| Tool | RTC | IR | BR | PR | RR | HC | UG | |:-----|:----:|:---:|:---:|:---:|:---:|:---:|:---:| | Available modes | NT | SG | NCT,SG | NCT | SG | NCT,SG | NT,NCT,SG | | Cluster nodes | 1 | 4 | 4 | 1 | 4 | 4 | 4 / 4 / 4 | | CPU threads (-nct
) | 1 | 1 | 8 | 4-8 | 1 | 4 | 3 / 6 / 24 | | Data threads (-nt
) | 24 | 1 | 1 | 1 | 1 | 1 | 8 / 4 / 1 | | Memory (Gb) | 48 | 4 | 4 | 4 | 4 | 16 | 32 / 16 / 4 |
Where NT is data multithreading, NCT is CPU multithreading and SG is scatter-gather using Queue or other data parallelization framework. For more details on scatter-gather, see the primer on parallelism for the GATK and the documentation on pipelining options.
Updated on 2016-07-20
From Geraldine_VdAuwera on 2014-09-01
Questions and comments up to August 2014 have been moved to an archival thread here:
http://gatkforums.broadinstitute.org/discussion/4560/questions-about-multithreading-parallelism
From jacobhsu on 2014-09-15
Sorry, i have to post at here in order to make it clearer. I guess I’m a bit confused. Dose parameter -nt act as the same as how many nodes (machines) ? From above information, you got the balance results by 24 nodes(machines) on RTC tool ?
From Geraldine_VdAuwera on 2014-09-22
@jacobhsu That’s correct.
From tommycarstensen on 2014-11-21
Any recommended configurations for HaplotypeCaller, CombineGVCFs and GenotypeGVCFs?
From Geraldine_VdAuwera on 2014-11-21
Not really, to be honest. I’ve tried to get the engineers to outline some recommendations but they are very reluctant to spit out any numbers. I will try again (it’s not stalking if it’s part of your job) but in the meantime I would say that trial and error (and lots of systematic testing) is your best bet.
From intipedroso on 2015-04-13
Hi,
i am running SplitNCigarReads with —num_threads 1 —num_cpu_threads_per_data_thread 1. I wanted to use 1 CPU and no more. However, as you can see on the line below some times it uses 40 CPUs or more. Why does this happen and how can I actually restrict the CPU usage to 1?
```
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 34701 ipedroso 20 0 38,799g 1,104g 12648 S 4037 0,2 7:58.99 java -Djava.io.tmpdir=./ -jar /home/ipedroso/APP/GenomeAnalysisTK.jar —num_threads 1 —num_cpu_threads_per_data_thread 1 -T SplitNCigarReads
```
Thanks in advance
From Geraldine_VdAuwera on 2015-04-15
@intipedroso This is well outside my scope, but I think I read somewhere that the JVM itself will utilize additional cores even if the application does not request them, so you may need to figure out how to constrain CPU usage by the JVM.
From Kurt on 2015-04-15
This comment from Picard FAQ’s may be useful (I’ve never had any interest to play with it myself). You should be able to call it when invoking java (e.g. java -jar —XX:ParallelGCThreads=1). I would see this on some picard programs when I use to look at these things a few years back (it seemed to me that it would spike when trying to write to file, but I could be wrong).
http://broadinstitute.github.io/picard/faq.html
Q: Why does a Picard program use so many threads?
A: This can be caused by the GC method of Java when used on 64 bit Java. By default the JVM switches to ‘server’ settings when on 64 bit, this automatically implements parallel GC and will use as many cores as it can get it’s hands on. The approach we decided on to get round this was to define the number of threads we would allow Java for GC.
-XX:ParallelGCThreads=
An alternative approach is to turn off Parallel Gc (boolean option so note the ‘-’ to indicate it is turned off):
-XX:+UseSerialGC
. We found this to be sub-optimal as the process has to stop completely when GC occurs and takes much longer as (from what I can tell) a full GC sweep is the only type performed which in many cases is not required (parallel GC employs ~7 different types of GC). See here for further details of the tuneable parameters.
From WANGxiaoji on 2015-05-04
I got a confusing log after I set “-nt” parameter to 24 when I use RealignerTargetCreator(GATK v3.2-0-g289df4b):
….
INFO 16:45:20,577 MicroScheduler – Running the GATK in parallel mode with 24 total threads, 1 CPU thread(s) for each of 24 data thread(s), of 48 proces
sors available on this machine
INFO 16:45:20,690 GenomeAnalysisEngine – Preparing for traversal over 1 BAM files
….
INFO 16:45:22,336 SAMDataSource$SAMReaders – Done initializing BAM readers: total time 0.01
INFO 16:45:22,337 SAMDataSource$SAMReaders – Initializing SAMRecords in serial
INFO 16:45:22,346 SAMDataSource$SAMReaders – Done initializing BAM readers: total time 0.01
INFO 16:45:22,347 SAMDataSource$SAMReaders – Initializing SAMRecords in serial
INFO 16:45:22,355 SAMDataSource$SAMReaders – Done initializing BAM readers: total time 0.01
INFO 16:45:22,356 SAMDataSource$SAMReaders – Initializing SAMRecords in serial
INFO 16:45:22,374 SAMDataSource$SAMReaders – Done initializing BAM readers: total time 0.02
INFO 16:45:22,375 SAMDataSource$SAMReaders – Initializing SAMRecords in serial
INFO 16:45:22,383 SAMDataSource$SAMReaders – Done initializing BAM readers: total time 0.01
INFO 16:46:11,404 ProgressMeter – 22:337201 0.0 50.0 s 83.1 w 91.2% 54.0 s 4.0 s
INFO 16:46:44,124 ProgressMeter – X:2549001 0.0 83.0 s 137.2 w 93.0% 89.0 s 6.0 s
INFO 16:47:14,127 ProgressMeter – X:5268801 0.0 113.0 s 186.8 w 93.1% 2.0 m 8.0 s
INFO 16:47:44,130 ProgressMeter – Y:2454001 8028160.0 2.4 m 17.0 s 98.0% 2.4 m 2.0 s
INFO 16:48:14,151 ProgressMeter – Y:6514901 8028160.0 2.9 m 21.0 s 98.1% 2.9 m 3.0 s
INFO 16:48:44,254 ProgressMeter – Y:10407301 8028160.0 3.4 m 25.0 s 98.2% 3.4 m 3.0 s
INFO 16:49:14,257 ProgressMeter – Y:15153801 8028160.0 3.9 m 29.0 s 98.4% 3.9 m 3.0 s
INFO 16:49:44,260 ProgressMeter – Y:19590301 8028160.0 4.4 m 32.0 s 98.5% 4.4 m 3.0 s
INFO 16:50:14,262 ProgressMeter – Y:21872001 8028160.0 4.9 m 36.0 s 98.6% 5.0 m 4.0 s
INFO 16:50:44,267 ProgressMeter – Y:26610901 8028160.0 5.4 m 40.0 s 98.7% 5.5 m 4.0 s
INFO 16:51:14,270 ProgressMeter – Y:29650601 8028160.0 5.9 m 43.0 s 98.8% 6.0 m 4.0 s
INFO 16:51:44,273 ProgressMeter – Y:35104501 8028160.0 6.4 m 47.0 s 99.0% 6.4 m 3.0 s
INFO 16:52:14,277 ProgressMeter – Y:39690601 8028160.0 6.9 m 51.0 s 99.2% 6.9 m 3.0 s
INFO 16:52:44,399 ProgressMeter – GL000221.1:113001 6.120736E7 7.4 m 7.0 s 99.9% 7.4 m 0.0 s
INFO 16:53:14,465 ProgressMeter – GL000192.1:547401 1.13589948E8 7.9 m 4.0 s 100.0% 7.9 m 0.0 s
….
I didn’t assign any interval(-L) to change the start point. But It seems that RealignerTargetCreator started to handle my task from chromosome 22 according to this log. Actually the input bam file only contains some reads mapped to chromosome 3. Interestingly I found that the start point of RealignerTargetCreator is determined by “-nt” value when “-L” is absent. I’m using a little bam file to check whether the result of the IndelRealignment will be affected by the absent of the assignment of the interval by “-L”.
I hope someone to tell me this confusing log is not the indicator of error.
From Sheila on 2015-05-04
@WANGxiaoji
Hi,
The log is not really indicative of where the tool is working. It shows checkpoints. Because you are not giving intervals, the checkpoints include the entire genome. The main thing to check is the output target intervals file and make sure it contains intervals relevant to your input bam file.
-Sheila
From WANGxiaoji on 2015-05-07
Thanks for @Sheila ‘s reply.
I also checked more implement of RealignerTargetCreator these days, and confirmed there is no error in the final bam output. But I do hope GATK will correct this misleading log info in the near future.
From dhfx on 2015-05-27
I am running GATK 3.4.0-g7e26428 on a 64-bit 8-core 16-thread Linux system, using a test case of 50 million short reads on chr15 of the human genome to estimate the timing for HaplotypeCaller. Without the -nct option I get ~ 26 minutes; with -nct 8 it’s ~ 20 minutes. The Java version is OpenJDK 64-Bit Server VM 1.7.0_79-b14. I would expect more like a x8 speedup; why am I not seeing that?
HaplotypeCaller does not appear to accept the -nt option. Is there any way (besides farming out the individual chromosomes) to run multiple data threads?
Thanks in advance for any helpful advice.
From Geraldine_VdAuwera on 2015-06-01
@dhfx Answered in http://gatkforums.broadinstitute.org/discussion/5620/what-is-status-of-multithreading-in-gatk-3-4
Please don’t post the same question in multiple places as this generates extra work for us for nothing. We read and answer everything, regardless of where it’s posted.
From y4dar on 2016-02-25
Just wondering if there has been an update to the potential use of multiple processors with HaplotypeCaller. I just started a process and I’m seeing an estimate of 3.3 weeks for the job to finish. That seems excessive. I could be doing something very wrong but I’m new enough that I’m not sure.
From Geraldine_VdAuwera on 2016-02-26
@y4dar Multithreading isn’t really the way to go with HaplotypeCaller; beyond 4 to 8 threads performance can actually get worse than without multithreading (depending on your system). In our own work we use scatter-gather parallelism to accelerate processing.
From BobHarris on 2016-03-18
Am I correct in assuming that if I don’t use -nt or -nct in my GATK command, the program will run as a single thread? Or, is there some other default? Or does it depend on the specific command being run?
From Geraldine_VdAuwera on 2016-03-18
That’s correct, all GATK tools run single-threaded unless you explicitly specify otherwise. People sometimes see multiple threads running but those belong to the Java garbage collection functions, not GATK.
From BobHarris on 2016-03-18
Thanks @Geraldine_VdAuwera. I hadn’t thought about java creating its own threads. I’ll have to think about that, and consider whether I need to account for that in my cluster job submissions.
From shanshanren on 2016-05-19
>
Geraldine_VdAuwera said: >
y4dar Multithreading isn’t really the way to go with HaplotypeCaller; beyond 4 to 8 threads performance can actually get worse than without multithreading (depending on your system). In our own work we use scatter-gather parallelism to accelerate processing.
Hi,
Does “In our own work we use scatter-gather parallelism to accelerate processing” mean that when you use Queue to accelerate HaplotypeCaller each node in a cluster run HaplotypeCaller only with one thread?
Thanks a lot.
From Geraldine_VdAuwera on 2016-05-20
@shanshanren Yes that’s correct.
From flyingflyers on 2017-08-04
How about MuTect2? Seems doesn’t support -nt but does work with -nct. I tried -nct 16, 72 h. -nct 4, 60h, both for human exome. What is the recommendations to make it run within 1 day, ideally? Thanks.
From Sheila on 2017-08-16
@flyingflyers
Hi,
You can check the tool documentation to see which arguments the tools take.
We recommend trying GATK4 Mutect2, as it is much faster than GATK3 MuTect2.
For some basic guidelines on how long the tools take to run, have a look at [this white paper](https://www.intel.com/content/www/us/en/healthcare-it/solutions/documents/deploying-gatk-best-practices-paper.html).
-Sheila
From jianxinwang on 2018-02-12
For GATK4: I don’t see multi-threading options for HaplotypeCaller (or any other tools in this version). Is it removed or there are better options? Another question: are there any pipelines available for running GATK on local HPCs using Slurm as job scheduler? It seems to me the new GATK4 is heavily tailored to run on cloud platforms, is that right? The GATK+WDL+Cromwell workflow is for the cloud platform not for local HPCs, right?
From Sheila on 2018-02-14
@jianxinwang
Hi,
There is no multi-threading in GATK4 tools, just the Spark versions of the tools. Have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/comment/45427#Comment_45427).
That is correct the workflows are catered to running on the cloud. However, if you search for “slurm” on the forum, you may get some threads that are interesting. In particular, [this thread](https://gatkforums.broadinstitute.org/wdl/discussion/9368/how-difficult-would-it-be-to-get-cromwell-working-on-slurm) and [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/10424/spark-in-other-clusters) may help.
-Sheila