Orthogonal Validations
For validation "calls," we did not implement any mutation callers.
Mutation callers were used on the data sets that created the "gold set," so validation procedure calls for a different prior. Thus, simple rules were followed that reflect the following criteria:
Validation Positive
Variant signal clearly exists in the tumor, and variant signal does not exist in the matched normal despite high degree of signal (i.e., good sequencing depth in the normal with no more variant reads than expected from sequencing error).
Validation Negative (two ways to be negatively validated)
There is plenty of sequencing depth in the tumor, but the number of variant reads is no more than expected from sequencing errors, or
There is clear variant signal in the matched normal indicating germline false positive
Uninterpretable
Not enough depth in tumor and/or normal to make a call for positive or negative validation.
There appears to be some variant signal in tumor, but not that much higher than expected from error, so we cannot be sure.
There appears to be some variant signal in normal, but we aren't sure if it's definitely germline signal or sequencing errors.
The exact thresholds for the rules are different for the two data sets, because the AmpliSeq is ~ 2000X in depth, and the Ion Torrent data is ~ 34 for tumor and ~ 47 for normal.
AmpliSeq on NextSeq 500 (2000X)
Three replicates of tumor and normal were sequenced, each at ~ 700X in depth.
Each pair of .fastq.gz files were aligned with BWA MEM as follows (no mark duplicate because of PCR enrichment)
docker run --rm -u $UID -v /:/mnt lethalfang/bwa:0.7.17_samtools bash -c \
"bwa mem \
-R '@RG\tID:AmpliSeqValidation\tLB:AmpliSeq\tPL:illumina\tSM:HCC1395' \
-M -t 12 \
/mnt/SEQC2_Resources/GRCh38.d1.vd1.fa \
/mnt/AmpliSeq_Validation/Breast-Carcinoma-SC-CRL-2324-1_180531_M04643_0036_000000000-BTW7M_L001_R1.fastq.gz \
/mnt/AmpliSeq_Validation/Breast-Carcinoma-SC-CRL-2324-1_180531_M04643_0036_000000000-BTW7M_L001_R2.fastq.gz \
| samtools view -Sbh - | samtools sort -m 4G --threads 12 -o /mnt/AmpliSeq_Validation/HCC1395_1.bam"
Figure 1
The figure on the left shows the AmpliSeq validation results.
Solid circles = positively validated. Open circles = negatively validated. X = uninterpretable.
AmpliSeq validation of the v1.0rc showed that lower-confidence calls have a higher probability of getting positively validated for low-VAF variants, along with other simple metrics, i.e., number of PASS and REJECT samples.
Based on that knowledge, and additional higher-depth data sets, confidence-levels were re-calibrated for low-VAF lower-confidence level calls.
All the material you need to reproduce AmpliSeq validation (except for the BAM files) are archived here. The annotation files are AmpliSeq.SNV.ValidationAnnotation.tsv and AmpliSeq.INDEL.ValidationAnnotation.tsv.
Ion Torrent WES on Ion S5 XL
Align and mark duplicates as follows for both tumor and normal
docker run --rm -u $UID -v /sc1:/sc1 -t lethalfang/tmap:3.0.1 \
/TMAP/tmap mapall -n 24 \
-f /sc1/SEQC2_Resources/GRCh38.d1.vd1.fa \
-r /sc1/IonTorrent/INT_EA_T_1_IonXpress_002.fastq.gz \
-v -Y -u \
-s /sc1/IonTorrent/INT_EA_T_1_IonXpress_002.bam \
-o 1 stage1 map4
samtools sort -m 4G --threads 4 -o /sc1/IonTorrent/INT_EA_T_1_IonXpress_002.sort.bam /sc1/IonTorrent/INT_EA_T_1_IonXpress_002.bam
docker run --rm -v /:/mnt -u $UID lethalfang/picard:2.18.4 \
java -Xmx16g -jar /opt/picard.jar MarkDuplicates \
I=/mnt/IonTorrent/INT_EA_N_1_IonXpress_002.sort.bam \
M=/mnt/IonTorrent/INT_EA_N_1_IonXpress_002.sort.dedup \
O=/mnt/IonTorrent/INT_EA_N_1_IonXpress_002.sort.dedup.bam \
CREATE_INDEX=true ASSUME_SORTED=true \
TMP_DIR=/mnt/IonTorrent/2018-07-31_15-04-44_508063456.temp
Agilent SureSelect v6 Exon UTR was used for capture. We downloaded the hg19 bed file, and lifted it over to GRCh38 at https://genome.ucsc.edu/cgi-bin/hgLiftOver. And then post-process the output BED file as follows:
cat hglft_genome_17c3_e7160.bed | egrep -v '_' > hglft_genome_17c3_e7160.majors.bed
bedtools sort -faidx GRCh38.d1.vd1.fa.fai -i hglft_genome_17c3_e7160.majors.bed > hglft_genome_17c3_e7160.majors.sort.bed
mergeBed -i hglft_genome_17c3_e7160.majors.sort.bed > hglft_genome_17c3_e7160.majors.sort.merged.GRCh38.bed
All the material you need to reproduce AmpliSeq validation (except for the BAM files) are archived here. The annotation files are IonTorrent.SNV.ValidationAnnotation.tsv and IonTorrent.INDEL.ValidationAnnotation.tsv.
BAM files are on CGC: INT_EA_T_1_IonXpress_001.sort.dedup.bam, INT_EA_N_1_IonXpress_002.sort.dedup.bam
Code to plot 95% binomial distribution confidence interval
from scipy.stats import beta
import math
def binom_interval(success, total, confint=0.95):
quantile = (1 - confint) / 2
lower = beta.ppf(quantile, success, total - success + 1)
upper = beta.ppf(1 - quantile, success + 1, total - success)
if math.isnan(lower): lower = 0
if math.isnan(upper): upper = 1
return (lower, upper)
# For 5% VAF at DP=100, the 95% interval is:
lowerVAF, upperVAF = binom_interval(5, 100)