created by cattle
on 2017-05-11
I tried to ERR000589 process data with BwaSpark. The bam file size is 1.3G. The average time spent is about 25 min (5 nodes).
However it would only cost 5 min in processing same data if I tried to use original C bwa with 32 threads.
Base on this observation, I have several questions list as follow:
1. If there is anything wrong with my params?
2. For each Partition, is BwaSpark running in multi-thread mode?
3. How to control the number of the bwa threads inside BwaSpark?
P.S.
The running command is:
./gatk-launch BwaSpark -I hdfs:///user/XX/ERR000589/ERR000589.bam -O hdfs:///user/XX/ERR000589/ERR000589_bwa.bam -R hdfs:///user/xx/refs/ucsc.hg19.fasta —bwamemIndexImage ~/data/ref/ucsc.hg19.img -disableSequenceDictionaryValidation true — —sparkRunner SPARK —sparkMaster —executor-cores 1 —total-executor-cores 16 —executor-memory 4G
I tried to further adjust the following parameters,
—executor-cores —total-executor-cores —executor-memory —driver-memory
but none of these took less time than 16 min
Besides, I alsow tried to run it in local mode, while it won’t end successfully. It seems that CPU was in endless waiting. I guess it occupied so much memory that the swap space is in use? Pic 1 shows the memory consumed while running
This time, the command is:
./gatk-launch BwaSpark -I hdfs:///user/XX/ERR000589/ERR000589.bam -O hdfs:///user/xx/ERR000589/ERR000589.bwa.bam -R /software/home/xx/data/ref/ucsc.hg19.fasta \ —bwamemIndexImage ~/data/ref/ucsc.hg19.img -disableSequenceDictionaryValidation true — —sparkRunner SPARK —sparkMaster local[*] —total-executor-cores 8 —executor-memory 20G —driver-memory 30G
BTW, the testing environment is:
CPU 2 × 8 physical core
node: 5
network: GBE
memory: 64G
From Sheila on 2017-05-15
@cattle
Hi,
I just moved your discussion to the GATK4 category where someone else will help.
-Sheila
From davidwb on 2017-05-17
@cattle I was asking myself the same question today. Looking at the GATK4 source, it appears BwaSpark is running bwa with a single thread. I suspected this after seeing very low cpu usage on worker nodes in the cluster. I hope they will add a —threads option to the BwaSpark tool soon.
From shlee on 2017-05-17
Hi cattle and
davidwb,
Thanks for your interest in our tools. BwaSpark is still under development and we hope to have answers for you in a month or two. Our GATK4 developers are just swamped right now.
From davidwb on 2017-05-17
@shlee No problem, just excited for the spark enabled tools to come out.
From Vzzarr on 2018-02-17
Are there any updates on Spark Cluster parameters configuration? I am struggling to find the right one to execute BwaAndMarkDuplicatesPipelineSpark for example, it always gets stuck after some execution hours
From Sheila on 2018-02-20
@Vzzarr
Hi,
I hope [this article](https://software.broadinstitute.org/gatk/documentation/article?id=11245) will help.
-Sheila
From Vzzarr on 2018-02-20
Thanks for your reply @Sheila ,
but as reported from the user question I would be interested in Spark parameters like `—executor-memory —driver-memory` and so on…
the execution gets always stuck when executed in cluster mode and I can’t reach the end of execution
From Sheila on 2018-02-26
@Vzzarr
Hi,
I am afraid Soo Hee’s answer from [here](https://gatkforums.broadinstitute.org/gatk/discussion/comment/43894#Comment_43894) still applies. I know there is an effort to better document the Spark tools, but we just have not gotten to it yet. I will see if the team can provide an individual answer for you soon though.
-Sheila