BAM Simulation pipeline in SomaticSeq
BamSurgeon Pipeline
Detailed descriptions of the simulation pipeline can be found in the SomaticSeq git repo here.
SomaticSeq SNV and INDEL classifiers were created for FD, IL, NS, and NV data sets, each with BWA MEM, NovoAlign, and Bowtie2 for 12 SNV classifiers and 12 INDEL classifiers.
Create training data for 12 Data Groups
For SEQC-II somatic mutation project, we took advantage of the fact that we have sequencing replicates of the normal genome the four aforementioned sequencing sites. Hence, we can use machine learning to "learn" the error model within each sequencing centers, and obtain high-confidence call.
Sequencing replicates aligned with the same aligner is considered a Data Group, e.g., a NovaSeq @IL aligned by BWA MEM is a Data Group.
Commands to create synthetic tumor-normal pairs
Example command for BWA-aligned data:
somaticseq/utilities/dockered_pipelines/bamSimulator/BamSimulator_multiThreads.sh \
--output-dir /ABSOLUTE/PATH/bamSims/bwa.FD1N_vs_FD2N \
--normal-bam-in /ABSOLUTE/PATH/WGS_FD_N_1.bwa.dedup.bam \
--tumor-bam-in /ABSOLUTE/PATH/WGS_FD_N_2.bwa.dedup.bam \
--normal-bam-out normalDesignate1N.bam \
--tumor-bam-out tumorDesignate2N.bam \
--genome-reference /ABSOLUTE/PATH/GRCh38.d1.vd1.fa \
--num-snvs 100000 \
--num-indels 20000 \
--num-svs 1000 \
--min-vaf 0 \
--max-vaf 1 \
--left-beta 2 \
--right-beta 5 \
--max-depth 1000 \
--seed 1981 \
--threads 36 \
--merge-output-bams --action echo --clean-bam
Example command for Bowtie2-aligned data:
#!/bin/bash
somaticseq/utilities/dockered_pipelines/bamSimulator/BamSimulator_multiThreads.sh \
--output-dir /ABSOLUTE/PATH/bowtie.FD1N_vs_FD2N \
--normal-bam-in /ABSOLUTE/PATH/WGS_FD_N_1.bowtie.dedup.bam \
--tumor-bam-in /ABSOLUTE/PATH/WGS_FD_N_2.bowtie.dedup.bam \
--normal-bam-out normalDesignate1N.bam \
--tumor-bam-out tumorDesignate2N.bam \
--genome-reference /ABSOLUTE/PATH/GRCh38.d1.vd1.fa \
--num-snvs 100000 \
--num-indels 20000 \
--num-svs 0 \
--min-vaf 0 \
--max-vaf 1 \
--left-beta 2 \
--right-beta 5 \
--max-depth 1000 \
--seed 1981 \
--threads 36 \
--aligner 'bowtie2 --alignopts bowtie2ref:/mnt/ABSOLUTE/PATH/GRCh38.d1.vd1.fa' \
--merge-output-bams --action echo --clean-bam
Example command for NovoAlign-aligned data:
#!/bin/bash
somaticseq/utilities/dockered_pipelines/bamSimulator/BamSimulator_multiThreads.sh \
--output-dir /ABSOLUTE/PATH/novo.FD1N_vs_FD2N \
--normal-bam-in /ABSOLUTE/PATH/WGS_FD_N_1.novo.dedup.bam \
--tumor-bam-in /ABSOLUTE/PATH/WGS_FD_N_2.novo.dedup.bam \
--normal-bam-out normalDesignate1N.bam \
--tumor-bam-out tumorDesignate2N.bam \
--genome-reference /ABSOLUTE/PATH/GRCh38.d1.vd1.fa \
--num-snvs 100000 \
--num-indels 20000 \
--num-svs 1000 \
--min-vaf 0 \
--max-vaf 1 \
--left-beta 2 \
--right-beta 5 \
--max-depth 1000 \
--seed 1981 \
--threads 36 \
--aligner 'novoalign --alignopts novoref:/mnt/ABSOLUTE/PATH/GRCh38.d1.vd1.fa.ndx' \
--merge-output-bams --action echo --clean-bam
Create training data for SomaticSeq
For SEQC-II, two pairs of normal replicates in the same sequencing center and aligner have gone through the same process (e.g., N_2 vs. N_1 and N_3 vs. N_2) with different seeds, so mutation positions will be different in those two pairs.
The two pairs were combined to make the SomaticSeq classifier we use
The two pairs were also used to cross-validate each other, to ensure that the classifiers we create are not replicate-specific.
Documentation
Synthetic Tumor-Normal for Combined NovaSeq Data Set
We combined 9 normal replicates and 9 tumor replicates into a 363X normal and 378X tumor, for each of the three aligners (BWA, Bowtie2, and NovoAlign).
Since we don't have replicates (we've combined all the replicates into a single super-replicate), we could not take the simulation strategy described above.
Instead, we've created the following two training set for each aligner:
Combine normal replicates #1 - #4, and treat that as a 159X normal. Combined normal replicates #5 - #9 as a 204X designated tumor, and then spike mutations into that.
Combine normal replicates #6 - #9, as treat that as a 162X normal. Combine normal replicates #1 - #5 as a 202X designated tumor, and then spike mutations into that.
So the question is, how good are the SomaticSeq classifiers when the training data's sequencing coverage is 1/2 the actual data? To answer this question, we'll take a look how good the original NovaSeq classifiers.
The original NovaSeq classifiers were created from combing the following two data sets
Normal #2 as designated tumor (51X), with Normal #1 as the normal (43X)
Normal #1 as the designated tumor (43X), with Normal #2 as the normal (51X)
So those classifiers were created using data whose sequencing depth is about 1/4 of the new training data set. Using those classifiers to classify synthetic mutations for this training data set, the results are:
Accuracy classified by Gold Set classifiers are not very much different from cross-validation results from Gold Set cross-validation. However, it is not as good as cross-validation from the two new data sets. This is not surprising. The point here is that, if classifiers created from data sets that have 1/4 the sequencing depth of the test data are this good (albeit not as good as if the sequencing depths were identical), then classifiers created from data sets that have 1/2 the sequencing depth should be better.
Other than this difference, everything else is done identically for the Combined 9 NovaSeq data sets.
Synthetic Tumor-Normal for the 300X Data from GT
For those data sets, I've split the normal BAM file into 55-45 tumor-normal split, with the following command
somaticseq/utilities/dockered_pipelines/bamSimulator/BamSimulator_multiThreads.sh \
--output-dir /ABSOLUTE/PATH/bamSimulation_01 \
--tumor-bam-in /ABSOLUTE/PATH/GT.300X.Normal.bwa.dedup.bam \
--split-proportion 0.45 \
--normal-bam-out syntheticNormal.bam \
--tumor-bam-out syntheticTumor.bam \
--genome-reference /ABSOLUTE/PATH/GRCh38.d1.vd1.fa \
--num-snvs 110000 \
--num-indels 20000 \
--num-svs 1500 \
--min-vaf 0 \
--max-vaf 1 \
--left-beta 2 \
--right-beta 5 \
--max-depth 10000 \
--seed 2018 \
--threads 36 \
--split-bam --merge-output-bams
TNscope was not used on this data set, but everything else was totally identical.