BAM Simulation pipeline in SomaticSeq

BamSurgeon Pipeline

Create training data for 12 Data Groups

Commands to create synthetic tumor-normal pairs

Example command for BWA-aligned data:

somaticseq/utilities/dockered_pipelines/bamSimulator/BamSimulator_multiThreads.sh \

--output-dir       /ABSOLUTE/PATH/bamSims/bwa.FD1N_vs_FD2N \

--normal-bam-in    /ABSOLUTE/PATH/WGS_FD_N_1.bwa.dedup.bam \

--tumor-bam-in     /ABSOLUTE/PATH/WGS_FD_N_2.bwa.dedup.bam \

--normal-bam-out   normalDesignate1N.bam \

--tumor-bam-out    tumorDesignate2N.bam \

--genome-reference /ABSOLUTE/PATH/GRCh38.d1.vd1.fa \

--num-snvs         100000  \

--num-indels       20000 \

--num-svs          1000 \

--min-vaf          0 \

--max-vaf          1 \

--left-beta        2 \

--right-beta       5 \

--max-depth        1000 \

--seed             1981 \

--threads          36 \

--merge-output-bams --action echo --clean-bam


Example command for Bowtie2-aligned data:

#!/bin/bash


somaticseq/utilities/dockered_pipelines/bamSimulator/BamSimulator_multiThreads.sh \

--output-dir       /ABSOLUTE/PATH/bowtie.FD1N_vs_FD2N \

--normal-bam-in    /ABSOLUTE/PATH/WGS_FD_N_1.bowtie.dedup.bam \

--tumor-bam-in     /ABSOLUTE/PATH/WGS_FD_N_2.bowtie.dedup.bam \

--normal-bam-out   normalDesignate1N.bam \

--tumor-bam-out    tumorDesignate2N.bam \

--genome-reference /ABSOLUTE/PATH/GRCh38.d1.vd1.fa \

--num-snvs         100000 \

--num-indels       20000 \

--num-svs          0 \

--min-vaf          0 \

--max-vaf          1 \

--left-beta        2 \

--right-beta       5 \

--max-depth        1000 \

--seed             1981 \

--threads          36 \

--aligner          'bowtie2 --alignopts bowtie2ref:/mnt/ABSOLUTE/PATH/GRCh38.d1.vd1.fa' \

--merge-output-bams --action echo --clean-bam


Example command for NovoAlign-aligned data:

#!/bin/bash


somaticseq/utilities/dockered_pipelines/bamSimulator/BamSimulator_multiThreads.sh \

--output-dir       /ABSOLUTE/PATH/novo.FD1N_vs_FD2N \

--normal-bam-in    /ABSOLUTE/PATH/WGS_FD_N_1.novo.dedup.bam \

--tumor-bam-in     /ABSOLUTE/PATH/WGS_FD_N_2.novo.dedup.bam \

--normal-bam-out   normalDesignate1N.bam \

--tumor-bam-out    tumorDesignate2N.bam \

--genome-reference /ABSOLUTE/PATH/GRCh38.d1.vd1.fa \

--num-snvs         100000 \

--num-indels       20000 \

--num-svs          1000 \

--min-vaf          0 \

--max-vaf          1 \

--left-beta        2 \

--right-beta       5 \

--max-depth        1000 \

--seed             1981 \

--threads          36 \

--aligner          'novoalign --alignopts novoref:/mnt/ABSOLUTE/PATH/GRCh38.d1.vd1.fa.ndx' \

--merge-output-bams --action echo --clean-bam

Create training data for SomaticSeq

For SEQC-II, two pairs of normal replicates in the same sequencing center and aligner have gone through the same process (e.g., N_2 vs. N_1 and N_3 vs. N_2) with different seeds, so mutation positions will be different in those two pairs. 

Documentation

Synthetic Tumor-Normal for Combined NovaSeq Data Set

We combined 9 normal replicates and 9 tumor replicates into a 363X normal and 378X tumor, for each of the three aligners (BWA, Bowtie2, and NovoAlign).

Since we don't have replicates (we've combined all the replicates into a single super-replicate), we could not take the simulation strategy described above. 

Instead, we've created the following two training set for each aligner:

So the question is, how good are the SomaticSeq classifiers when the training data's sequencing coverage is 1/2 the actual data? To answer this question, we'll take a look how good the original NovaSeq classifiers.

The original NovaSeq classifiers were created from combing the following two data sets

So those classifiers were created using data whose sequencing depth is about 1/4 of the new training data set. Using those classifiers to classify synthetic mutations for this training data set, the results are:

LargeClassifierCrossValidation

Accuracy classified by Gold Set classifiers are not very much different from cross-validation results from Gold Set cross-validation. However, it is not as good as cross-validation from the two new data sets. This is not surprising. The point here is that, if classifiers created from data sets that have 1/4 the sequencing depth of the test data are this good (albeit not as good as if the sequencing depths were identical), then classifiers created from data sets that have 1/2 the sequencing depth should be better. 

Other than this difference, everything else is done identically for the Combined 9 NovaSeq data sets. 

Synthetic Tumor-Normal for the 300X Data from GT

For those data sets, I've split the normal BAM file into 55-45 tumor-normal split, with the following command

somaticseq/utilities/dockered_pipelines/bamSimulator/BamSimulator_multiThreads.sh \

--output-dir       /ABSOLUTE/PATH/bamSimulation_01 \

--tumor-bam-in     /ABSOLUTE/PATH/GT.300X.Normal.bwa.dedup.bam \

--split-proportion 0.45 \

--normal-bam-out   syntheticNormal.bam \

--tumor-bam-out    syntheticTumor.bam \

--genome-reference /ABSOLUTE/PATH/GRCh38.d1.vd1.fa \

--num-snvs         110000 \

--num-indels       20000 \

--num-svs          1500 \

--min-vaf          0 \

--max-vaf          1 \

--left-beta        2 \

--right-beta       5 \

--max-depth        10000 \

--seed             2018 \

--threads          36 \

--split-bam --merge-output-bams

TNscope was not used on this data set, but everything else was totally identical.