SomaticSeq - General process
We use SAMtools [30] and GATK HaplotypeCaller on the tumor and normal BAM files to obtain a number of independent sequencing features that have predictive values for their somatic mutation statuses, e.g., mapping quality, base call quality, strand bias, depth of coverage, tail distance bias, etc. Some caller features, e.g., somatic mutation scores based on its distinct statistics, are also included. For the DREAM Challenge and real data, we also consider whether the site is in dbSNP. Two of the most important features in the adaptively boosted classifiers include the root-mean-square mapping quality score and the number of read mismatches compared to the reference.
For the results described in this study, we have used P≥0.7 as the cut-off for our SomaticSeq results, i.e., a candidate site of P≥0.7 is considered a PASS call, whereas a candidate site of P<0.7 is considered LowQual.
Since eight of the top 18 features related directly to sequencing depth, it is important for the trained model to have a comparable sequencing depth as the target set. Thus, it would not be appropriate to use a 30 × whole-genome sequence trained model to predict somatic mutations in a 500 × targeted sequencing
From http://bioinform.github.io/somaticseq/data.html
Where ‘here’ is a dead link and ‘this’ refers to the dead link here: https://drive.google.com/drive/folders/0B9pfRlnkG-Z7STNNczk4ak5xSmM
2) Trained on Stage 2, testing on variants of Stage 3 data
A) Tumor has three different VAFs (50 %, 33 %, and 20 %) representing three different subclones.
B) mixed the normal and tumor data at 95:5 ratio
C) mixed the tumor and normal data at a 70:30 ratio
D) D was the normal from Setting B and tumor from Setting C
Specific examinations
1) Train on Stage 2, test on straight Stage 3
mixed Stage 2 tumor/normal data at 70:30 ratio for training, test data was Stage 3
results were averaged over ten cross-validation results (the training set consists of half of the entire data set, randomly chosen). We performed twofold cross-validation ten times
3) In Silico Titration
4) SomaticSpike
5) COLO-829, CLL1 trained on Stage 3 data
Mutect)
dbSNP v.138, COSMIC v.69, Panel Of Normal based on Phase 1 of the 1kGP as resource files for the real sequencing data. Did not supply COSMIC for DREAM Challenge, because synthetic mutations were randomly chosen and not enriched in COSMIC sites. In our in silico titration and SomaticSpike experiments, none of these databases was used.
SomaticSniper) mapping quality cut-off 25, base quality cut-off 15, prior somatic mutation probability 10 −4
VarScan2) mapping quality cut-off 25, base quality cut-off of 20.
JointSNVMix2) convergence threshold of 0.01 in training, somatic probability ≥0.95
VarDict) relaxed the variant depth filter from 4 to 2, and the FET p-value cut-off from 0.05 to 0.15. allowed each call to fail for up to two out of 20 VarDict filters.
Installation:
Looks difficult to work with. Testimony:
Bcbio inclusion in SomaticSeq to exploit its decision tree seems difficult. There is a claim of a "more or less working" version of SomaticSeq that can include bcbio vcfs in its results. The dev asked Chapman, 10 days before the last commit to the modified SomaticSeq:
Chapman et al reported previously evaluating SomaticSeq for inclusion in bcbio and decided against, stating:
Existing bcbio variant calling pipelines
Example study comparing combinations of assembly and variant calling pipelines, with explicit shell commands
Trio pipeline example with explicit shell commands
bcbio guidance on installing new variant callers. more guidance
advice for adding variant callers to bcbio
bcbio authors summarizing their attempts at incorporating popular callers
bcbio discussion on on including MuSE (Feb2017)
Varscan is under maintenance (Mar2017)
Germline variant callers include GATK UnifiedGenotyper [15], GATK HaplotypeCaller [15], FreeBayes [16], SAMtools mpileup/bcftools [17], Isaac Variant Caller (IVC) [18] and Platypus [19]. Somatic variant callers include MuTect [20], Shimmer [21], SomaticSniper [22], Strelka [23], VarScan2 [24] and Virmid [25].
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0135800
Lumpy can integrate multiple signals. If all of our callers have continuous variable confidence scores they should fit.
from Welcome Trust, Adams
uses bambino, caveman, mpileup, varscan 2
integrates four publicly available somatic variant-calling algorithms to identify single nucleotide variants, Bambino, CaVEMan, SAMtools mpileup, and VarScan 2 with extra filtering. merge, consensus, filter model