Spark

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by GATK_Team

on 2018-01-19

In a nutshell, Spark is a piece of software that GATK4 uses to do multithreading, which is a form of parallelization that allows a computer (or cluster of computers) to finish executing a task sooner. You can read more about multithreading and parallelism in GATK here. The Spark software library is open-source and maintained by the Apache Software Foundation. It is very widely used in the computing industry and is one of the most promising technologies for accelerating execution of analysis pipelines.

Not all GATK tools use Spark

Tools that can use Spark generally have a note to that effect in their respective Tool Doc.

- Some GATK tools exist in distinct Spark-capable and non-Spark-capable versions

The "sparkified" versions have the suffix "Spark" at the end of their names. Many of these are still experimental; down the road we plan to consolidate them so that there will be only one version per tool.

- Some GATK tools only exist in a Spark-capable version

Those tools don't have the "Spark" suffix.

You don't need a Spark cluster to run Spark-enabled GATK tools!

If you're working on a "normal" machine (even just a laptop) with multiple CPU cores, the GATK engine can still use Spark to create a virtual standalone cluster in place, and set it to take advantage of however many cores are available on the machine -- or however many you choose to allocate. See the example parameters below and the local-Spark tutorial for more information on how to control this. And if your machine only has a single core, these tools can always be run in single-core mode -- it'll just take longer for them to finish.

To be clear, even the Spark-only tools can be run on regular machines, though in practice a few of them may be prohibitively slow (SV tools and PathSeq). See the Tool Docs for tool-specific recommendations.

If you do have access to a Spark cluster, the Spark-enabled tools are going to be extra happy but you may need to provide some additional parameters to use them effectively. See the cluster-Spark tutorial for more information.

Example command-line parameters

Here are some example arguments you would give to a Spark-enabled GATK tool:

--spark-master local[*] -> "Run on the local machine using all cores"
--spark-master local[2] -> "Run on the local machine using two cores"
--spark-master spark://23.195.26.187:7077 -> "Run on the cluster at 23.195.26.187, port 7077"
--spark-runner GCS --cluster my_cluster -> "Run on my_cluster in Google Dataproc"

You don't need to install any additional software to use Spark in GATK

All the necessary software for using Spark, whether it's on a local machine or a Spark cluster, is bundled within the GATK itself. Just make sure to invoke GATK using the gatk wrapper script rather than calling the jar directly, because the wrapper will select the appropriate jar file (there are two!) and will set some parameters for you.

Updated on 2018-06-26

From SkyWarrior on 2018-01-22

Hi,

—sparkMaster needs to be replaced with —spark-master for GATK 4.0.0.0

From Yingya on 2018-01-25

how to run PathSeqPipelineSpark on the local machine using 8 cores?

From DBPZ on 2018-05-31

I’m not sure if something has changed after GATK-4.0.0.0 :

1) spark is no longer included in the GATK package — it required the “spark-submit” command in $PATH.

2) the argument names have changed : now it is “—spark-runner SPARK —spark-master $SPARK_URL “.

3) the “spark-submit” command failed because the string (the “-D” options to java) for the “extraJavaOptions” argument was not quoted. I wrote a wrapper to add quotes around these “-D” options.

4) the “—spark-master” argument was given to the command (for example, SortSam) submitted to spark, causing an error. I removed it in the wrapper.

From Sheila on 2018-06-07

@DBPZ

Hi,

Are you saying these changes happened after the first official release of GATK4 or after the beta releases? The team is actively developing the Spark tools and they are in beta, so some changes are to be expected.

Thanks,

Sheila

From manolis on 2018-06-08

Hi @Sheila,

do you have any news about this tutorial ([Spark – How To](https://software.broadinstitute.org/gatk/documentation/article?id=11243 “How To”))?

All the best

From Sheila on 2018-06-18

@manolis

Hi,

I don’t have any news. I think the team is holding off on producing documentation for Spark until the Spark tools are out of beta. Perhaps [this Dictionary entry](https://software.broadinstitute.org/gatk/documentation/article?id=11245) will help a bit.

-Sheila

From KlausNZ on 2018-10-28

“- Some GATK tools only exist in a Spark-capable version

Those tools don’t have the “Spark” suffix.”

This is confusing (if true) – is there a way to to distinguish Spark-capable tools without ‘Spark’ suffix from Spark-incapable tools (also without ‘Spark’ suffix)?

From oskarv on 2018-12-13

Is there a timeline for when you plan to release the first stable spark versions of BaseRecalibrator, ApplyBQSR and HaplotypeCaller?

From shlee on 2018-12-13

Hi @oskarv,

This is hard to say. Others have asked similarly [here](https://gatkforums.broadinstitute.org/gatk/discussion/11243/how-to-run-spark-enabled-gatk-tools-on-a-local-multi-core-machine) and [here](https://gatkforums.broadinstitute.org/gatk/discussion/11244/how-to-run-spark-enabled-gatk-tools-on-a-spark-cluster). My guess is it will be a while. In the meantime, we hope you do try out the BETA versions of these Spark implementations and let us know what you think. It’s feedback from researchers like yourself that really helps drive the development of our tools forward.

From oskarv on 2019-01-07

@shlee

Ok, that’s good to know. We are looking to go into production and because the Spark versions are not recommended for this we will probably move forward with the non Spark versions until the Spark versions are out of beta.