Use SomaticSeq to run Dockerized Mutation Callers

Automated mutation caller and SomaticSeq run scripts

We have created automated run scripts for MuTect2, VarScan2, JointSNVMix2, SomaticSniper, VarDictJava, MuSE, LoFreq, Scalpel, and Strelkas. They are in the SomaticSeq repo: https://github.com/bioinform/somaticseq/

Once you cloned somaticseq from github, the mutation caller run scripts can be created (and submitted if desired) by running utilities/dockered_pipelines/submit_callers_singleThread.sh or utilities/dockered_pipelines/submit_callers_multiThreads.sh commands. Documentations for those pipelines as well as the source codes are  here: https://github.com/bioinform/somaticseq/tree/master/utilities/dockered_pipelines

Singularities:

We have also ported the dockerized pipeline to singularities. They'll have the same commands as described below, but replace "dockered_pipelines" locations with "singularities," i.e., here. The singularity pipeline actually pulls from docker images from the docker registry.

Required:


Build SomaticSeq classifiers for synthetic tumor-normal pairs

For SEQC2 somatic mutation projects, SomaticSeq and mutation calling tasks were primarily done on CGC. However, we have largely run them locally on Roche computing with synthetic tumor-normal pairs described here (figure below). 

Example Command

The command to run 5 somatic mutation callers on the pseudo-tumor vs. normal as described in the figure above:

#!/bin/bash


/home/fangl10/apps/somaticseq-2.7.2/utilities/dockered_pipelines/submit_callers_multiThreads.sh \

--normal-bam /sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/BAMs/WGS_IL_N_1.bwa.dedup.bam \

--tumor-bam /sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/tumorDesignate2N.bam \

--human-reference /sc1/groups/bfx-red/data/datainsights/SEQC2_Resources/GRCh38.d1.vd1.fa \

--dbsnp /sc1/groups/bfx-red/data/datainsights/SEQC2_Resources/dbSNP/dbsnp_146.hg38.vcf \

--threads 24 \

--truth-indel /sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/synthetic_indels.leftAlign.vcf  \

--truth-snv /sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/synthetic_snvs.vcf \

--output-dir /sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations \

--somaticseq-dir SomaticSeq.MSDUK.v2.7.2 \

--mutect2 --somaticsniper --vardict --muse --strelka --somaticseq

All those commands are stored here for archival purposes. 

TNscope, however, is still run on CGC

The script to combine all those VCF outputs, and then combine ground truth to create the truth-annotated TSV files

#!/bin/bash


#$ -o /sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/SomaticSeq.MSDUKT.v2.7.2/logs

#$ -e /sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/SomaticSeq.MSDUKT.v2.7.2/logs

#$ -S /bin/bash

#$ -l h_vmem=10G

set -e


docker pull lethalfang/somaticseq:2.7.2


echo -e "Start at `date +"%Y/%m/%d %H:%M:%S"`" 1>&2


docker run --rm -v /:/mnt -u 83748 lethalfang/somaticseq:2.7.2 \

/opt/somaticseq/SomaticSeq.Wrapper.sh \

--output-dir       /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/SomaticSeq.MSDUKT.v2.7.2 \

--genome-reference /mnt//sc1/groups/bfx-red/data/datainsights/SEQC2_Resources/GRCh38.d1.vd1.fa \

--tumor-bam        /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/tumorDesignate2N.bam \

--normal-bam       /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/BAMs/WGS_IL_N_1.bwa.dedup.bam \

--mutect2          /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/MuTect2.vcf \

--sniper           /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/SomaticSniper.vcf \

--vardict          /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/VarDict.vcf \

--muse             /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/MuSE.vcf \

--tnscope          /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/bwa.dedup-tumorDesignate.IL_N_2_vs_IL_N_1-TNScope.201711.02.vcf.gz \

--strelka-snv      /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/Strelka/results/variants/somatic.snvs.vcf.gz \

--strelka-indel    /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/Strelka/results/variants/somatic.indels.vcf.gz \

--inclusion-region /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/1.bed \

--dbsnp            /mnt//sc1/groups/bfx-red/data/datainsights/SEQC2_Resources/dbSNP/dbsnp_146.hg38.vcf \

--truth-snv        /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/synthetic_snvs.vcf \

--truth-indel      /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/synthetic_indels.leftAlign.vcf \


echo -e "Done at `date +"%Y/%m/%d %H:%M:%S"`" 1>&2

All those scripts are stored here for archival purposes. They are region-by-region, which means they will need to be combined. Two synthetic pairs are combined in each Data Group to make a SNV and a INDEL classifier for each Data Group.