Identify high-confidence Regions

GATK CallableLoci

We used GATK CallableLoci to identify callable regions in each of the 63 tumor-normal BAM files (i.e., 126 BAM files). 

GATK CallableLoci parameters:

The "consensus callable regions" are MajorityAlignersCallable.bed and mergeBed.MajorityAlignersCallable.bed (which is just mergeBed -i MajorityAlignersCallable.bed). 

An example GATK3 command used in those tasks are

docker run --rm -u 83748 broadinstitute/gatk3:3.8-0 \

java -Xmx8g -jar GenomeAnalysisTK.jar \

-T                              CallableLoci \

-R                              /PATH/TO/GRCh38.d1.vd1.fa \

-I                              /PATH/TO/WGS_IL_N_3.bwa.dedup.bam \

--maxDepth                      376 \

--maxFractionOfReadsWithLowMAPQ 0.1 \

--maxLowMAPQ                    1 \

--minBaseQuality                20 \

--minMappingQuality             20 \

--minDepth                      10 \

--minDepthForLowMAPQ            10 \

--summary                       /PATH/TO/WGS_IL_N_3.bwa.summary \

-o                              /PATH/TO/WGS_IL_N_3.bwa.Callable.bed

Count regions that are reproducibly callable

We used a SomaticSeq utility to count and identify common loci among the 63 tumor's WGS_SITE_T_REPLICATE.ALIGNER.Callable.bed with the following scheme. This utility will keep track and count regions that have appeared in a BED file (i.e., GATK CallableLoci's output), i.e., an example command:

lociCounterWithLabels.py \

-fai    genome.fa.fai \

-beds   replicate1.bed replicate2.bed replicate3.bed \

-labels Rep1 Rep2 Rep3

Imagine a contigX in genome.fa has a length of 1000, and the 3 bed files have the following content:

# replicate1.bed:

contigX    100    300

contigX    400    600

 

# replicate2.bed

contigX    80     320

contigX    400    550

 

# replicate3.bed

contigX    90     310

contigX    420    700

It will produce the following output, where the 4th column keeps track of how many times this region has been counted from the bed files, and the 5th column denotes the corresponding BED file (or its label) that overlaps this region:

X    0      80      0     .

X    80     90      1     Rep2

X    90     100     2     Rep2,Rep3

X    100    300     3     Rep1,Rep2,Rep3

X    300    310     2     Rep2,Rep3

X    310    320     1     Rep2

X    320    400     0     .

X    400    420     2     Rep1,Rep2

X    420    550     3     Rep1,Rep2,Rep3

X    550    600     2     Rep1,Rep3

X    600    700     1     Rep3

X    700    1000    0     .

Briefly as follows

1) For each sequencing centers and each aligner, we keep the callable loci where the majority of the data sets returned the region. An example file name output for this step is IL.bwa.MajorityCallable.bed. The regions there represents regions where at least 2 out of 3 replicates from IL for BWA has been identified as CALLABLE. We also grouped LLU, NCI, and EATRIS data sets into a single "site" as Other. For each aligner, then there are 5 BED files, i.e., IL, NS, FD, NV, and Other.

2) For each aligner, we used the same utility to get the loci where the majority of the loci returned from step 1, i.e., BWA.MajoritySitesCallable.bed was generated from IL.bwa.MajorityCallable.bed, NS.bwa.MajorityCallable.bed, FD.bwa.MajorityCallable.bed, NV.bwa.MajorityCallable.bed, and Other.bwa.MajorityCallable.bed.

3) Consolidate BWA.MajoritySitesCallable.bed, Novo.MajoritySitesCallable.bed, and Bowtie.MajoritySitesCallable.bed into MajorityAlignersCallable.bed.

4) Use Bedtools Merge to merge adjacent regions: mergeBed.MajorityAlignersCallable.bed

Commands and Scripts

Those three scripts were run successively to create the consensus callable bed files

Result Files

The following are the commands to generate those output BED files following the GATK CallableLoci outputs. Those files were zipped and uploaded to CGC: Consensus_GATK_CallableLoci.zip

The last file containing information for every data set is MajorityAlignersCallable.bed.gz and mergeBed.MajorityAlignersCallable.bed.

The region in MajorityAlignersCallable.bed consists 2.73 billion bp, or about 88% of the entire human genome. This is the file used to annotate FLAGS=NonCallable in 1st release of Gold Set for variants not in it.

As can be seen from the table below, of the somatic mutation candidates outside those area, over 95% of them are Tier 4 calls or worse (i.e., likely false positives), and only 1.5% calls are classified as Tier 1. Whereas in the callable regions, the fractions are 56.5% and 42.3%, respectively.

Once we have validation data sets, additional confidence-level information will be annotated in additional to the initial tiers, which is crude way to assign confidence-level based on cross-site and cross-aligner reproducibility of calls.

The stats below have mutations in 6p, 16q, chrX, and chrY removed

callableLociStats

High-Confidence Regions

To create the "High-Confidence Regions" as a part of the benchmark call set, 20-bp regions spanning LowConf and Unclassified positions were additionally subtracted from the mergeBed.MajorityAlignersCallable.bed described above. 

In the latest iteration, it is named High-Confidence_Regions_v1.2.bed here

GiaB high-confidence regions: