How does the BwaSpark in GATK4 control the number of threads

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by cattle

on 2017-05-11

I tried to ERR000589 process data with BwaSpark. The bam file size is 1.3G. The average time spent is about 25 min (5 nodes).

However it would only cost 5 min in processing same data if I tried to use original C bwa with 32 threads.

Base on this observation, I have several questions list as follow:

1. If there is anything wrong with my params?

2. For each Partition, is BwaSpark running in multi-thread mode?

3. How to control the number of the bwa threads inside BwaSpark?

P.S.

The running command is:

./gatk-launch BwaSpark -I hdfs:///user/XX/ERR000589/ERR000589.bam -O hdfs:///user/XX/ERR000589/ERR000589_bwa.bam -R hdfs:///user/xx/refs/ucsc.hg19.fasta —bwamemIndexImage ~/data/ref/ucsc.hg19.img -disableSequenceDictionaryValidation true — —sparkRunner SPARK —sparkMaster —executor-cores 1 —total-executor-cores 16 —executor-memory 4G

I tried to further adjust the following parameters,

—executor-cores —total-executor-cores —executor-memory —driver-memory

but none of these took less time than 16 min

Besides, I alsow tried to run it in local mode, while it won’t end successfully. It seems that CPU was in endless waiting. I guess it occupied so much memory that the swap space is in use? Pic 1 shows the memory consumed while running

This time, the command is:

./gatk-launch BwaSpark -I hdfs:///user/XX/ERR000589/ERR000589.bam -O hdfs:///user/xx/ERR000589/ERR000589.bwa.bam -R /software/home/xx/data/ref/ucsc.hg19.fasta \ —bwamemIndexImage ~/data/ref/ucsc.hg19.img -disableSequenceDictionaryValidation true — —sparkRunner SPARK —sparkMaster local[*] —total-executor-cores 8 —executor-memory 20G —driver-memory 30G

BTW, the testing environment is:

CPU 2 × 8 physical core

node: 5

network: GBE

memory: 64G

Attachments:

memory.png

cpu.png

From Sheila on 2017-05-15

@cattle

Hi,

I just moved your discussion to the GATK4 category where someone else will help.

-Sheila

From davidwb on 2017-05-17

@cattle I was asking myself the same question today. Looking at the GATK4 source, it appears BwaSpark is running bwa with a single thread. I suspected this after seeing very low cpu usage on worker nodes in the cluster. I hope they will add a —threads option to the BwaSpark tool soon.

From shlee on 2017-05-17

Hi cattle and davidwb,

Thanks for your interest in our tools. BwaSpark is still under development and we hope to have answers for you in a month or two. Our GATK4 developers are just swamped right now.

From davidwb on 2017-05-17

@shlee No problem, just excited for the spark enabled tools to come out.

From Vzzarr on 2018-02-17

Are there any updates on Spark Cluster parameters configuration? I am struggling to find the right one to execute BwaAndMarkDuplicatesPipelineSpark for example, it always gets stuck after some execution hours

From Sheila on 2018-02-20

@Vzzarr

Hi,

I hope [this article](https://software.broadinstitute.org/gatk/documentation/article?id=11245) will help.

-Sheila

From Vzzarr on 2018-02-20

Thanks for your reply @Sheila ,

but as reported from the user question I would be interested in Spark parameters like `—executor-memory —driver-memory` and so on…

the execution gets always stuck when executed in cluster mode and I can’t reach the end of execution

From Sheila on 2018-02-26

@Vzzarr

Hi,

I am afraid Soo Hee’s answer from [here](https://gatkforums.broadinstitute.org/gatk/discussion/comment/43894#Comment_43894) still applies. I know there is an effort to better document the Spark tools, but we just have not gotten to it yet. I will see if the team can provide an individual answer for you soon though.

-Sheila

Report abuse