Use SomaticSeq to run Dockerized Mutation Callers
Automated mutation caller and SomaticSeq run scripts
We have created automated run scripts for MuTect2, VarScan2, JointSNVMix2, SomaticSniper, VarDictJava, MuSE, LoFreq, Scalpel, and Strelkas. They are in the SomaticSeq repo: https://github.com/bioinform/somaticseq/.
Once you cloned somaticseq from github, the mutation caller run scripts can be created (and submitted if desired) by running utilities/dockered_pipelines/submit_callers_singleThread.sh or utilities/dockered_pipelines/submit_callers_multiThreads.sh commands. Documentations for those pipelines as well as the source codes are here: https://github.com/bioinform/somaticseq/tree/master/utilities/dockered_pipelines
Singularities:
We have also ported the dockerized pipeline to singularities. They'll have the same commands as described below, but replace "dockered_pipelines" locations with "singularities," i.e., here. The singularity pipeline actually pulls from docker images from the docker registry.
Required:
Have internet connection and docker service (or singularities), to be able to pull and run docker images from the docker hub.
Computing cluster management system such as Sun Grid Engine (SGE) that has "qsub" command.
When specifying $PATH/TO/dbSNP.vcf, there also needs to be dbSNP.vcf.gz and dbSNP.vcf.gz.tbi present at the same directory because MuSE and LoFreq are expecting the indexed bgzipped VCF files. If you don't run MuSE or LoFreq, then those extra files are not necessary.
After specifying the reference .fa (must have extensions of .fa or .fasta), it must also include the .dict and .fa.fai (or .fasta.fai) files in the same directory. The .dict file is required for VCF sorting in Scalpel. The .fa.fai is also required for genome splitting for multiThreads scripts (alternative is to supply a BED file)
Build SomaticSeq classifiers for synthetic tumor-normal pairs
For SEQC2 somatic mutation projects, SomaticSeq and mutation calling tasks were primarily done on CGC. However, we have largely run them locally on Roche computing with synthetic tumor-normal pairs described here (figure below).
Example Command
The command to run 5 somatic mutation callers on the pseudo-tumor vs. normal as described in the figure above:
#!/bin/bash
/home/fangl10/apps/somaticseq-2.7.2/utilities/dockered_pipelines/submit_callers_multiThreads.sh \
--normal-bam /sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/BAMs/WGS_IL_N_1.bwa.dedup.bam \
--tumor-bam /sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/tumorDesignate2N.bam \
--human-reference /sc1/groups/bfx-red/data/datainsights/SEQC2_Resources/GRCh38.d1.vd1.fa \
--dbsnp /sc1/groups/bfx-red/data/datainsights/SEQC2_Resources/dbSNP/dbsnp_146.hg38.vcf \
--threads 24 \
--truth-indel /sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/synthetic_indels.leftAlign.vcf \
--truth-snv /sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/synthetic_snvs.vcf \
--output-dir /sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations \
--somaticseq-dir SomaticSeq.MSDUK.v2.7.2 \
--mutect2 --somaticsniper --vardict --muse --strelka --somaticseq
All those commands are stored here for archival purposes.
TNscope, however, is still run on CGC.
The script to combine all those VCF outputs, and then combine ground truth to create the truth-annotated TSV files
#!/bin/bash
#$ -o /sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/SomaticSeq.MSDUKT.v2.7.2/logs
#$ -e /sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/SomaticSeq.MSDUKT.v2.7.2/logs
#$ -S /bin/bash
#$ -l h_vmem=10G
set -e
docker pull lethalfang/somaticseq:2.7.2
echo -e "Start at `date +"%Y/%m/%d %H:%M:%S"`" 1>&2
docker run --rm -v /:/mnt -u 83748 lethalfang/somaticseq:2.7.2 \
/opt/somaticseq/SomaticSeq.Wrapper.sh \
--output-dir /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/SomaticSeq.MSDUKT.v2.7.2 \
--genome-reference /mnt//sc1/groups/bfx-red/data/datainsights/SEQC2_Resources/GRCh38.d1.vd1.fa \
--tumor-bam /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/tumorDesignate2N.bam \
--normal-bam /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/BAMs/WGS_IL_N_1.bwa.dedup.bam \
--mutect2 /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/MuTect2.vcf \
--sniper /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/SomaticSniper.vcf \
--vardict /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/VarDict.vcf \
--muse /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/MuSE.vcf \
--tnscope /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/bwa.dedup-tumorDesignate.IL_N_2_vs_IL_N_1-TNScope.201711.02.vcf.gz \
--strelka-snv /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/Strelka/results/variants/somatic.snvs.vcf.gz \
--strelka-indel /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/Strelka/results/variants/somatic.indels.vcf.gz \
--inclusion-region /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/somaticMutations/1/1.bed \
--dbsnp /mnt//sc1/groups/bfx-red/data/datainsights/SEQC2_Resources/dbSNP/dbsnp_146.hg38.vcf \
--truth-snv /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/synthetic_snvs.vcf \
--truth-indel /mnt//sc1/groups/bfx-red/analysis/datainsights/projects/SEQC2/wg1/bamSims/bwa.IL1N_vs_IL2N/synthetic_indels.leftAlign.vcf \
echo -e "Done at `date +"%Y/%m/%d %H:%M:%S"`" 1>&2
All those scripts are stored here for archival purposes. They are region-by-region, which means they will need to be combined. Two synthetic pairs are combined in each Data Group to make a SNV and a INDEL classifier for each Data Group.