Building workflow on Seven Bridges Genomics
Progress update on workflow building on Seven Bridges Genomics:
AWS EC2 instance types: https://aws.amazon.com/ec2/instance-types/
MuTect2 (x):
Grab SM from BAMs: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/sm-extractor
Mutect2 App: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/mutect2-1
FilterMutectCall App: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/filtermutectcalls-1
MuTect2 Workflow: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/mutect2-workflow-1
Parallelized mutect2-workflow: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/parallel-mutect2-workflow
Test run completed successfully but parallelization was not as efficient as intended: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/77d3fde9-bc6e-4b12-86fd-2f0cfb1da78d/
Inefficient parallelization was due to resource request/limitation on the node. This one parallelized successfully and correctly: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/9411cf88-19fc-4393-9fb3-0862b684076a/
Scatter-and-gather successfully completed: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/717e1271-2eb6-499d-8c92-78865a3dd401/
Successful test run scatter-gather-sort: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/490ace9a-6294-485c-aaa7-9653d1ba0849
SomaticSniper (x):
SomaticSniper app: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/somaticsniper
SomaticSniper workflow (split the results into a region based on bed file input): https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/somaticsniper-workflow
SomaticSniper run split, i.e., run SomaticSniper on a single thread, but then use BEDTools to split the result into region-wise VCF files, to be consistent (region-wise) with other tools: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/somaticsniper-run-split
VarDict (x):
App to split bed into 5000 bp per line in bed file: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/bed-splitter
VarDictJava: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/vardictjava
VarDict's testsomatic: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/vardict-testsomatic
VarDict's var2vcf: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/vardict-var2vcf
VarDict Workflow: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/vardict-workflow
Successfully tested on local rabix-executor, but has trouble on the cloud.
Successful test run on the cloud: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/e217516a-5d18-440c-8f56-17306b000df2/
Had those tumor_bam and normal_bam as position 99 and 100, and they no longer appear on the command line. Also later input '' as value in those inputs, and hopefully that'll fix it for the foreseeable future.
parallel-vardict-workflow: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/parallel-vardict-workflow
test run: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/7728ef60-0b11-46ee-85c6-23d130877228
Successful test of scatter-gather-sort: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/5f2f5b3e-87f7-42bc-90b8-6ff0bef77f7a
MuSE (x):
Keep first 3 columns of bed file: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/trim-bed-to-3-columns
MuSE call command: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/muse-call-1
MuSE sump command: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/muse-sump-1
Introduced a hack to copy dbsnp.vcf.gz and its index, and touch the index to have later timestamp.
https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/bd46cf45-1089-4951-9402-f8de8a3ad00b/
MuSE workflow: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/muse-workflow-1
Can execute successfully on local rabix-executor, but on the cloud MuSE is complaining about the dbsnp.vcf.gz's index, e.g.,
https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/b04b9489-2cd0-4d0f-8820-d0bb6b251265/
https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/ffe97765-f582-4355-a5bb-caa2b2e9b83a/. This parallelization seems to work properly, though the job failed for the same dbsnp.vcf.gz index issues.
To get around MuSE complaining about vcf.gz.idx being older than vcf.gz, I made muse-sump copy/link and touch the file. The workflow hack that succeeded: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/7fc84cc4-0bad-4b42-aace-17590cf842e0/
Successful test of scatter-gather-sort: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/e758b599-662b-4e4f-9f0b-b5cf091bba1f
Scalpel (x):
Complete Scalpel workflow: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/scalpel-1
parallel-scalpel-workflow:
Strelka (x):
strelka-config and strelka-run on one go: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/strelka-config-1
Successfully Strelka run: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/3c3b7f61-665f-4add-9f37-bea0c6241c56/
Not yet tested: to automatically bgzip and tabix index the input BED file. Shouldn't be too difficult.
strelka-workflow (taking in bed file and make bed.gz): https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/strelka-workflow
parallel-strelka-workflow that allows parallelization by region as well as the -j N parameter. This is more flexible because it does not have to have a node with more cores than threads specified: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/parallel-strelka-workflow
Successful test run: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/685fbb0d-aad4-410e-9aa0-2ce26ed3f816/
Results identical to the non-region-parallelized strelka job.
Successful test run of scatter-gather-sort: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/7983f49d-b75c-4f60-a653-5c2c68a0e897
Sentieon TNscope:
Copied workflow:
SomaticSeq.Wrapper.sh (x):
somaticseq-wrapper: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/somaticseq-wrapper-1
BED-parallelizer (x):
Split a BED file into X number of bed files of equal region size: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/bed-parallelizer
APP to vcf-concat input VCF files according to sort -V of their input names:
app: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/vcf-concat-sort
Tested to this parallel-mutect2-workflow: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/ac9e97f8-1f8e-43df-ac53-8d57b05d7a9a
SomaticSeq workflow with MuTect2(M), SomaticSniper(S), and Strelka(K):
somaticseq-workflow-MSK: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/somaticseq-workflow-msk
Jobs are failing so far, but parallelization seems to be working: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/1fffae52-c6a5-4781-906d-49744a6c765c
Does not seem to be able to pass array of VCF files of each tool to somaticseq-wrapper.
Multiple caller workflow with MuTect2(M), SomaticSniper(S), VarDict(D), and Strelka(K):
somaticCallers-MSDK: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/somaticcallers-msdk
Use bed-parallelizer to split input BED into multiple equal-sized BED files, then use those to parallelize.
At the end, use vcf-concat-sort to combine the VCF files from each region: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/vcf-concat-sort
Successful test runs: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/284b31c7-519e-4c59-8ee0-67c7122166a7
Running GATK CallableLoci:
Successful test job: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/dc205640-cfb8-4cd0-b1d1-f112430e684c/
Upload files to CGC:
Command to upload (an example)
$cgc-uploader/bin/cgc-uploader.sh -t xxxxxAuthenticationKeyxxxxxx --project xiaowen/fda-seqc2-wg-1 --tag bamSurgeon --manifest-file $ABSOLUTE/PATH/TO/file-manifest.tsv --manifest-metadata
Create a .meta file for each file to be uploaded (an example)
{
"sample_id": "bwa.tumorDesignate_IL_N_2",
"library_id": "bwa",
"platform": "illumina HiSeq"
}