Building workflow on Seven Bridges Genomics

Progress update on workflow building on Seven Bridges Genomics:

AWS EC2 instance types: https://aws.amazon.com/ec2/instance-types/

MuTect2 (x):

Grab SM from BAMs: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/sm-extractor
Mutect2 App: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/mutect2-1
FilterMutectCall App: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/filtermutectcalls-1
MuTect2 Workflow: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/mutect2-workflow-1
- Successful test run: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/a3081eb2-5de0-4ae6-a935-a5967e0ffc2b/
Parallelized mutect2-workflow: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/parallel-mutect2-workflow
- Test run completed successfully but parallelization was not as efficient as intended: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/77d3fde9-bc6e-4b12-86fd-2f0cfb1da78d/
- Inefficient parallelization was due to resource request/limitation on the node. This one parallelized successfully and correctly: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/9411cf88-19fc-4393-9fb3-0862b684076a/
- Scatter-and-gather successfully completed: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/717e1271-2eb6-499d-8c92-78865a3dd401/
- Successful test run scatter-gather-sort: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/490ace9a-6294-485c-aaa7-9653d1ba0849

SomaticSniper (x):

SomaticSniper app: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/somaticsniper
- Successful test run: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/07a83384-5259-4950-9e96-fb65976e1244/
SomaticSniper workflow (split the results into a region based on bed file input): https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/somaticsniper-workflow
SomaticSniper run split, i.e., run SomaticSniper on a single thread, but then use BEDTools to split the result into region-wise VCF files, to be consistent (region-wise) with other tools: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/somaticsniper-run-split
- Successful test run: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/8abe554f-61c5-4d91-b467-1b093c1d93f0/

VarDict (x):

App to split bed into 5000 bp per line in bed file: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/bed-splitter
VarDictJava: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/vardictjava
VarDict's testsomatic: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/vardict-testsomatic
VarDict's var2vcf: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/vardict-var2vcf
VarDict Workflow: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/vardict-workflow
- Successfully tested on local rabix-executor, but has trouble on the cloud.
- Successful test run on the cloud: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/e217516a-5d18-440c-8f56-17306b000df2/
  - Had those tumor_bam and normal_bam as position 99 and 100, and they no longer appear on the command line. Also later input '' as value in those inputs, and hopefully that'll fix it for the foreseeable future.
parallel-vardict-workflow: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/parallel-vardict-workflow
- test run: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/7728ef60-0b11-46ee-85c6-23d130877228
- Successful test of scatter-gather-sort: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/5f2f5b3e-87f7-42bc-90b8-6ff0bef77f7a

MuSE (x):

Keep first 3 columns of bed file: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/trim-bed-to-3-columns
MuSE call command: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/muse-call-1
MuSE sump command: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/muse-sump-1
- Introduced a hack to copy dbsnp.vcf.gz and its index, and touch the index to have later timestamp.
- https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/bd46cf45-1089-4951-9402-f8de8a3ad00b/
MuSE workflow: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/muse-workflow-1
- Can execute successfully on local rabix-executor, but on the cloud MuSE is complaining about the dbsnp.vcf.gz's index, e.g.,
  - https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/b04b9489-2cd0-4d0f-8820-d0bb6b251265/
  - https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/ffe97765-f582-4355-a5bb-caa2b2e9b83a/. This parallelization seems to work properly, though the job failed for the same dbsnp.vcf.gz index issues.
  - To get around MuSE complaining about vcf.gz.idx being older than vcf.gz, I made muse-sump copy/link and touch the file. The workflow hack that succeeded: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/7fc84cc4-0bad-4b42-aace-17590cf842e0/
  - Successful test of scatter-gather-sort: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/e758b599-662b-4e4f-9f0b-b5cf091bba1f

Scalpel (x):

Complete Scalpel workflow: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/scalpel-1
- Successful run: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/f1d61dac-bcb5-4385-a007-7f4ae6f9e018/
parallel-scalpel-workflow:
- Successful test run: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/3cc727c1-a8ca-445f-9c27-6472dcaf4862/

Strelka (x):

strelka-config and strelka-run on one go: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/strelka-config-1
- Successfully Strelka run: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/3c3b7f61-665f-4add-9f37-bea0c6241c56/
- Not yet tested: to automatically bgzip and tabix index the input BED file. Shouldn't be too difficult.
strelka-workflow (taking in bed file and make bed.gz): https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/strelka-workflow
- Successful test run: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/4ebd373a-fb0a-405f-a0e2-ae41e834a5b4/
parallel-strelka-workflow that allows parallelization by region as well as the -j N parameter. This is more flexible because it does not have to have a node with more cores than threads specified: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/parallel-strelka-workflow
- Successful test run: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/685fbb0d-aad4-410e-9aa0-2ce26ed3f816/
- Results identical to the non-region-parallelized strelka job.
- Successful test run of scatter-gather-sort: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/7983f49d-b75c-4f60-a653-5c2c68a0e897

Sentieon TNscope:

Copied workflow:
- Run submitted: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/07f74e4c-7a3b-4614-9629-f76490e585d0/stats/

SomaticSeq.Wrapper.sh (x):

somaticseq-wrapper: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/somaticseq-wrapper-1
- Successful runs:
  - https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/69487785-bf93-4f74-9377-4ffae95a4f3c/
  - https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/3cd74553-e78d-4541-b810-f8b7f92b9513/

BED-parallelizer (x):

Split a BED file into X number of bed files of equal region size: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/bed-parallelizer

APP to vcf-concat input VCF files according to sort -V of their input names:

app: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/vcf-concat-sort
- Tested to this parallel-mutect2-workflow: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/ac9e97f8-1f8e-43df-ac53-8d57b05d7a9a

SomaticSeq workflow with MuTect2(M), SomaticSniper(S), and Strelka(K):

somaticseq-workflow-MSK: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/somaticseq-workflow-msk
Jobs are failing so far, but parallelization seems to be working: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/1fffae52-c6a5-4781-906d-49744a6c765c
Does not seem to be able to pass array of VCF files of each tool to somaticseq-wrapper.

Multiple caller workflow with MuTect2(M), SomaticSniper(S), VarDict(D), and Strelka(K):

somaticCallers-MSDK: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/somaticcallers-msdk
Use bed-parallelizer to split input BED into multiple equal-sized BED files, then use those to parallelize.
At the end, use vcf-concat-sort to combine the VCF files from each region: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/apps/#xiaowen/fda-seqc2-wg-1/vcf-concat-sort
Successful test runs: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/284b31c7-519e-4c59-8ee0-67c7122166a7

Running GATK CallableLoci:

Successful test job: https://cgc.sbgenomics.com/u/xiaowen/fda-seqc2-wg-1/tasks/dc205640-cfb8-4cd0-b1d1-f112430e684c/

Upload files to CGC:

Command to upload (an example)

$cgc-uploader/bin/cgc-uploader.sh -t xxxxxAuthenticationKeyxxxxxx --project xiaowen/fda-seqc2-wg-1 --tag bamSurgeon --manifest-file $ABSOLUTE/PATH/TO/file-manifest.tsv --manifest-metadata

Create a .meta file for each file to be uploaded (an example)

{

"sample_id": "bwa.tumorDesignate_IL_N_2",

"library_id": "bwa",

"platform": "illumina HiSeq"

}

Google Sites

Report abuse