Base Quality Score Recalibration BQSR

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by rpoplin

on 2012-07-23

BQSR stands for Base Quality Score Recalibration. In a nutshell, it is a data pre-processing step that detects systematic errors made by the sequencer when it estimates the quality score of each base call. This document starts with a high-level overview of the purpose of this method; deeper technical are provided further down.

Note that this base recalibration process (BQSR) should not be confused with variant recalibration (VQSR), which is a sophisticated filtering technique applied on the variant callset produced in a later step. The developers who named these methods wish to apologize sincerely to any Spanish-speaking users who might get awfully confused at this point.

Wait, what are base quality scores again?

These scores are per-base estimates of error emitted by the sequencing machines; they express how confident the machine was that it called the correct base each time. For example, let's say the machine reads an A nucleotide, and assigns a quality score of Q20 -- in Phred-scale, that means it's 99% sure it identified the base correctly. This may seem high, but it does mean that we can expect it to be wrong in one case out of 100; so if we have several billion basecalls (we get ~90 billion in a 30x genome), at that rate the machine would make the wrong call in 900 million bases. In practice each basecall gets its own quality score, determined through some dark magic jealously guarded by the manufacturer of the sequencer.

Variant calling algorithms rely heavily on the quality score assigned to the individual base calls in each sequence read. This is because the quality score tells us how much we can trust that particular observation to inform us about the biological truth of the site where that base aligns. If we have a basecall that has a low quality score, that means we're not sure we actually read that A correctly, and it could actually be something else. So we won't trust it as much as other base calls that have higher qualities. In other words we use that score to weigh the evidence that we have for or against a variant allele existing at a particular site.

Okay, so what is base recalibration?

Unfortunately the scores produced by the machines are subject to various sources of systematic (non-random) technical error, leading to over- or under-estimated base quality scores in the data. Some of these errors are due to the physics or the chemistry of how the sequencing reaction works, and some are probably due to manufacturing flaws in the equipment.

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. For example we can identify that, for a given run, whenever we called two A nucleotides in a row, the next base we called had a 1% higher rate of error. So any base call that comes after AA in a read should have its quality score reduced by 1%. We do that over several different covariates (mainly sequence context and position in read, or cycle) in a way that is additive. So the same base may have its quality score increased for one reason and decreased for another.

This allows us to get more accurate base qualities overall, which in turn improves the accuracy of our variant calls. To be clear, we can't correct the base calls themselves, i.e. we can't determine whether that low-quality A should actually have been a T -- but we can at least tell the variant caller more accurately how far it can trust that A. Note that in some cases we may find that some bases should have a higher quality score, which allows us to rescue observations that otherwise may have been given less consideration than they deserve. Anecdotally my impression is that sequencers are more often over-confident than under-confident, but we do occasionally see runs from sequencers that seemed to suffer from low self-esteem.

Fantastic! How does it work?

The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model. The known variants are used to mask out bases at sites of real (expected) variation, to avoid counting real variants as errors. Outside of the masked sites, every mismatch is counted as an error. The rest is mostly accounting.

There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes.

More detailed information

Detailed information about command line options for BaseRecalibrator can be found here.

The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.

This process is accomplished by analyzing the covariation among several features of a base. For example:

    • Reported quality score
    • The position within the read
    • The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine

These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.

The system was designed for (sophisticated) users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.

Running the tools

BaseRecalibrator

Detailed information about command line options for BaseRecalibrator can be found here.

This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:

    • read group the read belongs to
    • assigned quality score
    • machine cycle producing this base
    • current base + previous base (dinucleotide)

For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.

Creating a recalibrated BAM

To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:

java -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R reference.fasta \ -I input.bam \ -BQSR recalibration_report.grp \ -o output.bam

After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.

This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.

Effectively the new quality score is:

    • the sum of the global difference between reported quality scores and the empirical quality
    • plus the quality bin specific shift
    • plus the cycle x qual and dinucleotide x qual effect

Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.

Miscellaneous information

    • The recalibration system is read-group aware. It separates the covariate data by read group in the recalibration_report.grp file (using @RG tags) and PrintReads will apply this data for each read group in the file. We routinely process BAM files with multiple read groups. Please note that the memory requirements scale linearly with the number of read groups in the file, so that files with many read groups could require a significant amount of RAM to store all of the covariate data.
    • A critical determinant of the quality of the recalibation is the number of observed bases and mismatches in each bin. The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group. 1B bases yields significantly better results.
    • Unless your database of variation is so poor and/or variation so common in your organism that most of your mismatches are real snps, you should always perform recalibration on your bam file. For humans, with dbSNP and now 1000 Genomes available, almost all of the mismatches - even in cancer - will be errors, and an accurate error model (essential for downstream analysis) can be ascertained.
    • The recalibrator applies a "yates" correction for low occupancy bins. Rather than inferring the true Q score from # mismatches / # bases we actually infer it from (# mismatches + 1) / (# bases + 2). This deals very nicely with overfitting problems, which has only a minor impact on data sets with billions of bases but is critical to avoid overconfidence in rare bins in sparse data.

Example pre and post recalibration results

    • Recalibration of a lane sequenced at the Broad by an Illumina GA-II in February 2010
    • There is a significant improvement in the accuracy of the base quality scores after applying the GATK recalibration procedure

The output of the BaseRecalibrator

    • A Recalibration report containing all the recalibration information for the data

Note that the BasRecalibrator no longer produces plots; this is now done by the AnalyzeCovariates tool.

The Recalibration Report

The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:

    • Arguments Table -- a table with all the arguments and its values
    • Quantization Table
    • ReadGroup Table
    • Quality Score Table
    • Covariates Table

Arguments Table

This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).

Example Arguments table:

#:GATKTable:true:1:17::; #:GATKTable:Arguments:Recalibration argument collection values used in this run Argument Value covariate null default_platform null deletions_context_size 6 force_platform null insertions_context_size 6 ...

Quantization Table

The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.

The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.

Example Arguments table:

#:GATKTable:true:2:94:::; #:GATKTable:Quantized:Quality quantization map QualityScore Count QuantizedScore 0 252 0 1 15972 1 2 553525 2 3 2190142 9 4 5369681 9 9 83645762 9 ...

ReadGroup Table

This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:; #:GATKTable:RecalTable0: ReadGroup EventType EmpiricalQuality EstimatedQReported Observations Errors SRR032768 D 40.7476 45.0000 2642683174 222475 SRR032766 D 40.9072 45.0000 2630282426 213441 SRR032764 D 40.5931 45.0000 2919572148 254687 SRR032769 D 40.7448 45.0000 2850110574 240094 SRR032767 D 40.6820 45.0000 2820040026 241020 SRR032765 D 40.9034 45.0000 2441035052 198258 SRR032766 M 23.2573 23.7733 2630282426 12424434 SRR032768 M 23.0281 23.5366 2642683174 13159514 SRR032769 M 23.2608 23.6920 2850110574 13451898 SRR032764 M 23.2302 23.6039 2919572148 13877177 SRR032765 M 23.0271 23.5527 2441035052 12158144 SRR032767 M 23.1195 23.5852 2820040026 13750197 SRR032766 I 41.7198 45.0000 2630282426 177017 SRR032768 I 41.5682 45.0000 2642683174 184172 SRR032769 I 41.5828 45.0000 2850110574 197959 SRR032764 I 41.2958 45.0000 2919572148 216637 SRR032765 I 41.5546 45.0000 2441035052 170651 SRR032767 I 41.5192 45.0000 2820040026 198762

Quality Score Table

This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.

#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:; #:GATKTable:RecalTable1: ReadGroup QualityScore EventType EmpiricalQuality Observations Errors SRR032767 49 M 33.7794 9549 3 SRR032769 49 M 36.9975 5008 0 SRR032764 49 M 39.2490 8411 0 SRR032766 18 M 17.7397 16330200 274803 SRR032768 18 M 17.7922 17707920 294405 SRR032764 45 I 41.2958 2919572148 216637 SRR032765 6 M 6.0600 3401801 842765 SRR032769 45 I 41.5828 2850110574 197959 SRR032764 6 M 6.0751 4220451 1041946 SRR032767 45 I 41.5192 2820040026 198762 SRR032769 6 M 6.3481 5045533 1169748 SRR032768 16 M 15.7681 12427549 329283 SRR032766 16 M 15.8173 11799056 309110 SRR032764 16 M 15.9033 13017244 334343 SRR032769 16 M 15.8042 13817386 363078 ...

Covariates Table

This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.

#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:; #:GATKTable:RecalTable2: ReadGroup QualityScore CovariateValue CovariateName EventType EmpiricalQuality Observations Errors SRR032767 16 TACGGA Context M 14.2139 817 30 SRR032766 16 AACGGA Context M 14.9938 1420 44 SRR032765 16 TACGGA Context M 15.5145 711 19 SRR032768 16 AACGGA Context M 15.0133 1585 49 SRR032764 16 TACGGA Context M 14.5393 710 24 SRR032766 16 GACGGA Context M 17.9746 1379 21 SRR032768 45 CACCTC Context I 40.7907 575849 47 SRR032764 45 TACCTC Context I 43.8286 507088 20 SRR032769 45 TACGGC Context D 38.7536 37525 4 SRR032768 45 GACCTC Context I 46.0724 445275 10 SRR032766 45 CACCTC Context I 41.0696 575664 44 SRR032769 45 TACCTC Context I 43.4821 490491 21 SRR032766 45 CACGGC Context D 45.1471 65424 1 SRR032768 45 GACGGC Context D 45.3980 34657 0 SRR032767 45 TACGGC Context D 42.7663 37814 1 SRR032767 16 AACGGA Context M 15.9371 1647 41 SRR032764 16 GACGGA Context M 18.2642 1273 18 SRR032769 16 CACGGA Context M 13.0801 1442 70 SRR032765 16 GACGGA Context M 15.9934 1271 31 ...

Troubleshooting

The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.

If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.

I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.

All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.

I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?

Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.

I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.

The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:

    • First do an initial round of SNP calling on your original, unrecalibrated data.
    • Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator.
    • Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

Downsampling to reduce run time

For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.

Updated on 2016-06-22

From jhess on 2016-12-28

How does BQSR handle the “reserved” quality score 2 (ASCII value #), or what Illumina refers to as the “Read Segment Quality Control Indicator”? According to [this](https://drive.google.com/file/d/0B-lLYVUOliJFYjlkNjAwZjgtNDg4ZC00MTIyLTljNjgtMmUzN2M0NTUyNDE3/view?ddrp=1&hl=en) presentation (see second-to-last slide), bases with quality 2 belong to entire portions at the 3’ end of reads that should be trimmed before alignment (e.g. via BWA’s -q option), likely due to cluster dephasing that the machine’s calling algorithm couldn’t recover.

It seems from the base quality histograms above that BQSR is ignoring these quality 2 bases (I assume this because of the “Equal Heights” label):

![xx](https://us.v-cdn.net/5019796/uploads/FileUpload/6f/5309fc58b1e90cfedced982c9cda83.png)

but I would just like to confirm that this is indeed the case. Furthermore, is it ignoring these because they were clipped by the aligner (does BQSR ignore soft clipped bases?), or is it aware of the special meaning of quality score 2?

Thanks so much!

From oskarv on 2017-01-04

>

Geraldine_VdAuwera said: >

oskarv Sorry your question got lost amid the other conversation on this thread. Yes, we expect to get marginal differences in output when parallelizing execution of BaseRecalibrator. It shouldn’t cause major differences, however. Are you seeing very different results?

Yes, we are using one pipeline based on a ruby script, and we are remaking it with scatter gather using WDL, and we can’t wrap our heads around why the ruby pipeline creates a 130GB file from baserecalibrator using only reference-fasta/input/-BQSR/output as settings, while the WDL based pipeline creates a 79GB file with the same settings and scatter gather parallelization. Up until that point the input files are almost exactly the same size, the input files that were created by MarkDuplicates only differ by 0.003247125 megabytes. But then, as far as I can tell, PrintReads with scatter gather parallelization creates an output file that is 51 GB smaller than the one created with -nct 8 for parallelization instead.

Here’s the BaseRecalibration code from the ruby script:

```

java -jar gatk-3.5.jar -T BaseRecalibrator -nct 8 -R human_g1k_v37_decoy.fasta -I markdup.bam -o recal.grp —phone_home NO_ET —gatk_key rr-research.no.key -knownSites dbsnp.vcf -knownSites 1000G_phase1.indels.b37.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -cov ContextCovariate -cov CycleCovariate

```

N.B we are aware that —phone_home is no longer in use, it has since been deleted from the script.

And the BaseRecalibration code from WDL:

```

java -jar gatk-3.5.jar -T BaseRecalibrator -R human_g1k_v37_decoy.fasta -I markdup.bam -o recal.csv -knownSites dbsnp.vcf -knownSites 1000G_phase1.indels.b37.vcf -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf -L 1:1+ -cov ContextCovariate -cov CycleCovariate

```

As a side note, in the WDL script, I’ve tried using -o recal.grp instead of .csv, but it crashes if I do, they’re identical so it doesn’t affect the analysis, but I find it peculiar.

And here’s the PrintReads code used by the ruby script:

```

java -jar gatk-3.5.jar -T PrintReads -nct 8 -R human_g1k_v37_decoy.fasta -I markdup.bam -BQSR recal.grp -o printreads.bam —phone_home NO_ET —gatk_key /Jar/research.no.key

```

The WDL code for PrintReads looks like this:

```

java -jar gatk-3.5.jar -T PrintReads -R human_g1k_v37_decoy.fasta -I markdup.bam -BQSR recal.csv -L 1:1+ -o printreads.bam

```

So as you can see, the code is identical apart from -nct and scatter gather, and the input files from MarkDuplicates are practically identical. How can this be explained?

From Geraldine_VdAuwera on 2017-01-04

@oskarv When you scatter-gather, do you handle unmapped reads explicitly? If not, that’s probably what’s being dropped from your scatter-gathered output. Have a look at the wdl we use in production, documented [here](https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160927.md), for an example of how to do it.

From Geraldine_VdAuwera on 2017-01-04

@jhess That’s correct, Q2 bases are considered to be special and left untouched by BQSR.

From oskarv on 2017-01-05

>

Geraldine_VdAuwera said: >

oskarv When you scatter-gather, do you handle unmapped reads explicitly? If not, that’s probably what’s being dropped from your scatter-gathered output. Have a look at the wdl we use in production, documented [here](https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160927.md), for an example of how to do it.

>

This is a bit embarrassing, but it was supposed to say ApplyBQSR and not PrintReads. But I think your suggested solution is probably correct though. I based my pipeline on your public version, but since I couldn’t figure out the unmapped bam part when I first made the pipeline, I forgot it and didn’t consider it when the output files were smaller than expected. And then I switched out ApplyBQSR for PrintReads to fix it, and that’s when I started mixing things up.

But I have another question now, I suppose this is relevant whether I’ve implemented correct handling of unmapped reads or not, and this question applies for both PrintReads and ApplyBQSR. The total size of the shard folders from PrintReads is 182 GB, but the GatherBamFiles output file is 263 GB? While the total size of the shard folders from ApplyBQSR is 129 GB, and the GatherBamFiles folder is 79 GB. So for PrintReads, the GatherBamFiles folder is larger than the total size of the shards, but it’s the other way around for ApplyBQSR. Is this expected behavior?

From Geraldine_VdAuwera on 2017-01-07

@oskarv If you’re using ApplyBQSR, I assume you’re using GATK4? That doesn’t change my answer but until we switch to GATK4 as general supported version, be sure to mention it when you post, it’s good for us to know because in some cases it makes a difference. Anyway I hope the unmapped reads thing pans out.

Regarding the GatherBamFiles folder question, I don’t know off the top of my head because I’ve not done a side by side comparison myself. That does seem surprising — but if the final results are the same it could just indicate an implementation difference. Can I assume you’re running PrintReads out of version 3.5 and ApplyBQSR from GATK4? There’s the version coming into play, now :)

And have you run a stat on the final files to evaluate whether they’re substantially different or not?

From oskarv on 2017-01-10

>

Geraldine_VdAuwera said: >

oskarv If you’re using ApplyBQSR, I assume you’re using GATK4? That doesn’t change my answer but until we switch to GATK4 as general supported version, be sure to mention it when you post, it’s good for us to know because in some cases it makes a difference. Anyway I hope the unmapped reads thing pans out.

Yes, I’m using GATK4, and I’ll be sure to be clear about that in the future. I added functionality for unmapped reads to the pipeline and the final VCF file is the same size as our reference pipelines’ final VCF file. But when I run my home made script to compare the contents of the VCF files, the files are significantly different.

The script takes two VCF files as input, strips out everything except for the SNP/INDEL positions and compares the files and gives a summary like this:

```

Output from a GATK-4/GATK-3.5/Picard-2.5 WDL pipeline compared with output from a GATK-3.5/Picard-2.5 ruby reference pipeline.

WDL.INDEL.vcf has 4748746 positions and 19515 unique positions.

RUBY.INDEL.vcf has 4743030 positions and 13799 unique positions.

They have 4729231 positions in common.

```

With regard to the number of positions, the numbers are all identical when the SNP files from the ruby- and wdl-based pipelines are compared.

The output files start to differ from the reference pipeline once ApplyBQSR from GATK-4 runs. The most recent run that ran with unmapped reads resulted in a 75GB file from ApplyBQSR/GatherBamFiles, while the output from PrintReads/GATK-3.5 is 130GB. The rest of the output files are pretty darn equal after ApplyBQSR though, but the contents differ more than we expect it to.

As far as I can tell, the tool settings are identical, so I’m pretty lost right now.

I put the WDL script in this pastebin, available for 1 month: http://pastebin.com/EfNZWzKx

This is the ruby preprocessing script: http://pastebin.com/Gv2tBDGn

And here’s the ruby germline script: http://pastebin.com/F6ugP33G

From Geraldine_VdAuwera on 2017-01-10

Hi @oskarv, can you please summarize what you’re comparing? I.e. what is the same vs what is different, step by step, and what exactly is different in the output? I’m a little confused. Also, have you looked at what is different right after the applyBQSR step? E.g. is the distribution of base qualities different, things like that?

From oskarv on 2017-01-11

>

Geraldine_VdAuwera said: > Hi

oskarv, can you please summarize what you’re comparing? I.e. what is the same vs what is different, step by step, and what exactly is different in the output?

The final two steps in the pipeline is VariantRecalibrator and ApplyRecalibration for SNPs and INDELs, and the files I’m comparing is the ApplyRecalibration INDEL VCF file from the wdl pipeline and the ApplyRecalibration INDEL VCF file from the ruby pipeline.

Here’s the script: http://pastebin.com/Pfk4jb8N

It begins with taking the two VCF files as input and strips them free from everything except for the values under the column “POS”. It then sorts the values and stores them in a temporary file, one per VCF file. These files then get compared to count the number of values that they have in common, the values they have in common are saved to a file called “common_positions”, and then it compares the “common_positions” file with the complete files to find all of the unique positions in each VCF file.

The rest of the script only prints out the values to the terminal and into a new file called “compare-positions-summary”. It’s not elegant, but it gets the job done.

So the difference that is seen, i.e the values, is the SNP/INDEL positions in one VCF file that are or aren’t present in the other VCF file.

>Also, have you looked at what is different right after the applyBQSR step? E.g. is the distribution of base qualities different, things like that?

I have never done that before, so I googled and found samstat.

Here’s samstat results for the WDL BAM file from GatherBamFiles/ApplyBQSR: http://pastebin.com/dhnZBmdK (copy/paste the code to a new file and name it e.g samstat-wdl.html and open with your favorite browser, the results are nicely plotted in various graphs and such)

And here’s samstat results for the BAM file from PrintReads: http://pastebin.com/1Kx9snn8

There’s no combined comparison unfortunately.

But what about the calibration reports from BaseRecalibrator? In case they’re of interest, they’re in the links below:

WDL recalibration report from BaseRecalibrator, GATK 3.5, run with scatter gather parallelization: http://pastebin.com/nB282cqb

Ruby recalibration report from BaseRecalibrator, GATK 3.5, run with -nct parallelization: http://pastebin.com/aL5smpcA

Thank you for all of your help by the way, I really appreciate it.

From henryscut on 2017-01-13

Hello, I am having a problem using BQSR, I ended up having a bam file with base quality dramatically downgraded, I am not sure whether it is normal. Before recalibration, the base quality score in bam are a lot of “D-J“s, which correspond to base quality of 35-41, after recalibration, they all turned to around 27 (a lot of >), one example is: “BB1BDDEFHHHHFIIEGIJJIJJJJJJJIIIHIIIJJAGGIJIIJGEHIGIJIIIIIGIGIIHHECEFFEE>AC>?BBCCDDDDDDDCDC”, after recalibration, it becomes: “;:+<==<===<<<<;<<<;;;;;<;;;<::<<::<;;;<<;<;;<<<<;;<<;;<;;<=<<<<=<<>=><<<<<<=======<:<<”. I am working on a plant species, it has ~400Mb genome, and for the dbsnp, I got ~7M high quality sites. Otherwise, I use default values for “BaseRecalibrator” and “PrintReads”, I am using GATK3.5. Does anybody have a clue? Any suggestion is much appreciated!

From Geraldine_VdAuwera on 2017-01-13

You may be under-masking the real variation if your dbsnp set is not wide enough. How many variants do you typically expect in an individual of this species? And did you generate the recal plots?

From henryscut on 2017-01-14

Thank you! @Geraldine_VdAuwera !

I think 7M sites out of 400Mb is pretty descent already, the problem is that different samples may vary 10-20% in genome sequences, they may have new sequences, or large SVs, which is not included in the dbsnp sets, for dbsnp sets I only get snps and small indels from samtools. Does this affect? are there any cure for this?

For recal plots, you mean the base quality score distribution before/after recalibration? If so, see below:

score.png

From Geraldine_VdAuwera on 2017-01-15

@oskarv The graphical plots that are produced by BQSR will be more informative than the recal tables. In terms of comparing the output of the pipeline, comparing the numbers of calls themselves is not very informative. We expect some differences in marginal calls, so what’s important to find out here is whether the calls that change (either appearing or disappearing depending on the version you use) are all low-quality or not. If they are we would be comfortable saying that the variation is due to minor effects like rounding differences. If not, then it’s more worrying. Having some kind of measure of whether those are expected to be real calls or not (based eg on comparison to known sites) would be useful as well. We provide some documentation for ways to evaluate variant callsets, but we can’t guide you through it step by step as it’s largely out of scope of the support we can provide. We’re fairly short-handed at the moment and can’t spare the resources, I’m afraid.

From Geraldine_VdAuwera on 2017-01-15

@henryscut Our recommendations and testing are based on human variation, so they may not be as appropriate for your plant species. If there is a lot of variation that is not captured in your dbsnp, then it will look like a lot of errors to the BQSR algorithm. That could cause what you’re seeing. You could try bootstrapping the process as described in the documentation for species that do not have a dbsnp available. We’ve helped others with similar problems in the past, so be sure to search the forum for threads that describe similar issues.

From henryscut on 2017-01-16

@Geraldine_VdAuwera Okay, thanks a lot!

From angezou on 2017-01-22

Hello,

I was wondering what happens if we download a dbsnp file and it most likely contains way more snps than the mismatches in my bam file? How will positions that should contain a snp but doesn’t have a mismatch be treated? I am asking because I have quite low coverage data (4x – 5x) and from looking at the alignment on IGV it doesn’t look like the coverage is consistent. Would you recommend not using the dbsnp file and doing my own bootstrapping (calling high confidence variants and using those for bqsr) instead? Thanks!

From Sheila on 2017-01-24

@angezou

Hi,

If you are working with samples that have a known variation sites VCF, you can use that file in BQSR. It is okay to overmask the sites a little, rather than undermask the sites. Only the positions that are variant in the known VCF will be masked out in BQSR. The sites that may have novel variation in your own samples will not be masked out.

-Sheila

From oskarv on 2017-02-10

Is it possible to use PrintReads -bqsr on unmapped reads? It’s giving me this error message when I try:

```

ERROR MESSAGE: java.lang.Long cannot be cast to java.lang.Byte

```

It works fine with ApplyBQSR from GATK 4 with the same settings and input files.

Here’s the command I’m using:

```

java -jar GenomeAnalysisTK-3.5.jar -T PrintReads -I sample.bam -R human_g1k_v37_decoy.fasta -L unmapped -o out.bam -BQSR bqsr.csv

```

Am I missing something?

From Geraldine_VdAuwera on 2017-02-10

I think it should work. Might be a java version problem; 3.5 expects java 1.7, whereas later version expect 1.8.

From ibseq on 2017-02-10

hi,

how can we run PrintReads on multiple bam files? what do we gove to -I ?

From Geraldine_VdAuwera on 2017-02-10

@ibseq Same as all tools that take -I, just do eg `-I first.bam -I second.bam -I third.bam`.

From oskarv on 2017-02-13

> @Geraldine_VdAuwera said:

> I think it should work. Might be a java version problem; 3.5 expects java 1.7, whereas later version expect 1.8.

It turned out to be an incompatibility between GatherBQSR from GATK4 and PrintReads from GATK3. Using

```

java -cp GATK3.jar org.broadinstitute.gatk.tools.GatherBqsrReports I=input1.grp I=input2.grp O=out.grp

```

solved it. GatherBQSR from GATK4 didn’t sort the output by Cycles and Context in RecalTable2, PrintReads didn’t like that.

From Geraldine_VdAuwera on 2017-02-13

Oh I see, we’ll have to document that. Thanks for pointing this out!

From ibseq on 2017-02-22

>

Geraldine_VdAuwera said: >

oskarv When you scatter-gather, do you handle unmapped reads explicitly? If not, that’s probably what’s being dropped from your scatter-gathered output. Have a look at the wdl we use in production, documented [here](https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_160927.md), for an example of how to do it.

>

> @Geraldine_VdAuwera said:

> Oh I see, we’ll have to document that. Thanks for pointing this out!

Hi Geraldine,

thanks a lot for all these post, although I am very confused about the steps to take when we don’t have a list of known snps/indels. The website gets pretty confusing, and my impression is that for some steps we just get lost.

I have my raw reads and up to the markduplicates steps is all ok.

then:

1. I run Haplotype caller in GVCF mode to generate per sample a gVCF file:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I my.bam —emitRefConfidence GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -o my.raw.snps.indels.g.vcf

2. I create a list of all the g.vcf by running

java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ref.fasta —variant gGVCF.list -o all.vcf

Now, I haven’t done the base recalibration as described here above my post: java -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R reference.fasta \ -I input.bam \ -BQSR recalibration_report.grp \ -o output.bam

also because, I don’t know how to generate -the recalibration_report.grp for BQSR. I skipped this step and instead I use the output of Haplotypecaller from step 2 and run always per sample

3. java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R ref.fasta -I my.bam -knownSites my.raw.snps.indels.g.vcf -o recal_data.table

4. java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R ref.fasta -I my.bam -knownSites my.raw.snps.indels.g.vcf -BQSR recal_data.table -o post_recal_data.table

5. java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R ref.fasta -before recal_data.table -after post_recal_data.table -plots my.pdf

6. java -jar GenomeAnalysisTK.jar -T PrintReads -R ref.fasta -I my.bam -BQSR recal_data.table -o my_new.bam

I check the results and eventually I should repeat steps 3/4/5/6 starting from my new calibrated bam file? Or before generating the new bam file? if yes, what do I use as -knownsites?

Once I am happy with the reports in the pdfs, do I have to generate a new g.vcf per samples and a new all.vcf using GenotypeGVCFs? how do I procede?

Thanks very much for the help.

ib

From ibseq on 2017-02-22

sorry numbers got renamed: 1-2-1-2-3-4 are 1-2-3-4-5-6

From Geraldine_VdAuwera on 2017-02-27

Hi @ibseq,

I realize the bootstrapping process is a bit confusing and we haven’t documented it systematically yet (too many other high-priority tasks) but we have answered what is basically this same question multiple times for other users on the forum, sometimes in a lot of detail. We can’t afford to take the time to go over this again right now, so can you please try to find one of the other user threads that addresses this on the forum? @Sheila may be able to point you to a specific thread. Thanks for understanding.

From Sheila on 2017-02-28

@ibseq

Hi ib,

I pointed you to [this thread](http://gatkforums.broadinstitute.org/firecloud/discussion/comment/28256/#Comment_28256) in another post. I think you already found that thread, but the specific comments I linked to should be helpful. You also may find [this thread](http://gatkforums.broadinstitute.org/gatk/discussion/4337/doesnt-base-quality-score-recalibration-degrade-sensitivity-in-heavily-mutated-cancers) useful. There are indeed many threads that discuss this. I did a simple search for “bootstrap” and got many helpful threads.

-Sheila

From mscjulia on 2017-05-25

I apologize to keep posting questions. In fact, you can ignore or delete my previous two comment posts, and I will summarize my questions here:

1. I’m wondering which parameter in the VCF file (raw vcf, before VQSR) is influenced by the BQSR step. Is it GQ or something else? If a site obtains a very low score from BQSR, would HaplotypeCaller exclude this site from variant calling?

2. In BQSR, two tables were generated by two steps, with file extension .grp and post.grp. To my understanding, the first step already converted reported quality score to an empirical score for every base, and I could find both scores in the .grp file. However, what exactly was done in the second step please? I see in the post.grp file, all tables are with same columns, but different values.

My guess is that the empirical score was used again here, but as reported score, to generate a new set of empirical score… I know it’s probably wrong so please correct me and provide your advice. Thanks a lot.

From Sheila on 2017-05-30

@mscjulia

Hi,

I just deleted the first two posts. Thanks for summarizing here.

1) BQSR adjusts the base quality scores if it is necessary. For example, some base qualities may go up or down after BQSR. The annotation values will not be affected much unless they directly use the base qualities (such as BaseQualityRankSum). As for the GQ and QUAL, they could be affected if there is a major difference in base qualities after BQSR. Have a look at [this article](https://software.broadinstitute.org/gatk/documentation/article?id=4442) for how base qualities are used in determining genotypes. Also, some of the links in that article explain how QUAL and GQ use base qualities.

If a site has all low base qualities, then yes, that site will not be used in variant calling. However, HaplotypeCaller takes into account all bases with quality of 10 or higher. You can change the default to a lower or higher number if you wish. Have a look at the tool doc for information on how to do that.

2) I think the BQSR presentation [here](https://drive.google.com/drive/folders/0BwTg3aXzGxEDNGtSazZoYWRNY3c) will help.

-Sheila

From anath on 2017-06-06

Hi

I am using the new version of GATK (GATK4) and performing BaseRecalibration.

I am unable to do a second pass to analyze covariation remaining after recalibration. The -BQSR option gives an error. Has this option changed in the new version?

Arti

From Sheila on 2017-06-07

@anath

Hi Arti,

Can you post the exact commands you ran in the workflow?

Thanks,

Sheila

From anath on 2017-06-08

Hi Sheila,

Thanks for getting back to me.

I am using the new version (GATK4-alpha) of gatk

step1 BQSR first-pass report file

gatk4 BaseRecalibrator -R genome.fa -I SRR4026630_dedup_reads.bam -knownSites 1000G_phase1.indels.b37.vcf -knownSites 1000G_phase1.snps.high_confidence.b37.vcf

-O SRR4026630_recal_data.table

step2 BQSR second-pass report file

gatk4 BaseRecalibrator -R genome.fa -I SRR4026630_dedup_reads.bam -knownSites 1000G_phase1.indels.b37.vcf -knownSites 1000G_phase1.snps.high_confidence.b37.vcf

-bqsr SRR4026630_recal.table -O SRR4026630_post_recal.table

A USER ERROR has occurred: b is not a recognized option

Instead I tried to run ApplyBQSR to get the second pass report

gatk4 ApplyBQSR -R genome.fa -I SRR4026630_dedup_reads.bam -bqsr SRR4026630_A_recal_data.table -O ./SRR4026630_A_post_recal_data.table

However, the output is a .bam file and I do not get a post table file to use as an INPUT with “AnalyzeCovariates”. ApplyBQSR seems to just recalibrate the bases and not produce a table for plotting. Is there a way to generate a post report which can be plotted later on?

Artika

From Sheila on 2017-06-12

@anath

Hi Artika,

I am checking with the team. I will get back to you asap.

-Sheila

From Sheila on 2017-06-14

@anath

Hi Artika,

[This page](https://github.com/broadinstitute/gatk/issues/322) explains the workflow.

-Sheila

From dilawerkh4 on 2017-07-14

To perform base recalibration the documetation requires a VCF database of known polymorphic sites to mask out such as dbSNP to be used as an input file.

Which one of the following dbSNP files do you suggest using for WGS of worldwide populations. Link to the dbSNP page with the files is:

[https://ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/](https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/)

There is a file called “All” and another called “common_all” in table 1 on their page.

Thanks

From shlee on 2017-07-14

Hi @dilawerkh4,

Please use the resources in our resource bundle. You can find links to it at .

From dilawerkh4 on 2017-07-14

>

shlee said: > Hi

dilawerkh4,

>

> Please use the resources in our resource bundle. You can find links to it at .

Thanks @shlee

So just to be sure this is the suggested file: 1000G_phase1.snps.high_confidence.b37.vcf.gz , right?

Also, the base recalibrator documentation does not indicate that the index file for the above is needed, namely : 1000G_phase1.snps.high_confidence.b37.vcf.idx.gz , right?

From shlee on 2017-07-14

@dilawerkh4,

If you are just starting out in your analyses, I think it will be helpful for you to spend some time going over our Best Practices documentation as well as forum. The resources we provide are not the latest but have been shown empirically to work well for past analyses for the tool parameters we recommend in our Best Practices. As a beginner, using the resources we provide should get you going in the right direction.

Here is a (slightly outdated) reference implementation that our own production uses: https://gatkforums.broadinstitute.org/gatk/discussion/7899/reference-implementation-pairedendsinglesamplewf-pipeline. And the most current implementation of this is in WDL script format at https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_170412.wdl. Inputs for the scripts are given in https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_170412.inputs.json. Specifically, BQSR uses the following known sites resources.

"##_COMMENT3": "KNOWN SITES RESOURCES", "PairedEndSingleSampleWorkflow.dbSNP_vcf": "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf", "PairedEndSingleSampleWorkflow.dbSNP_vcf_index": "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx", "PairedEndSingleSampleWorkflow.known_indels_sites_VCFs": [ "gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz", "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz" ], "PairedEndSingleSampleWorkflow.known_indels_sites_indices": [ "gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi", "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi" ],

From dilawerkh4 on 2017-07-15

@shlee said: @dilawerkh4,

If you are just starting out in your analyses, I think it will be helpful for you to spend some time going over our Best Practices documentation as well as forum. The resources we provide are not the latest but have been shown empirically to work well for past analyses for the tool parameters we recommend in our Best Practices. As a beginner, using the resources we provide should get you going in the right direction.

Here is a (slightly outdated) reference implementation that our own production uses: https://gatkforums.broadinstitute.org/gatk/discussion/7899/reference-implementation-pairedendsinglesamplewf-pipeline. And the most current implementation of this is in WDL script format at https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_170412.wdl. Inputs for the scripts are given in https://github.com/broadinstitute/wdl/blob/develop/scripts/broad_pipelines/PublicPairedSingleSampleWf_170412.inputs.json. Specifically, BQSR uses the following known sites resources.

"##_COMMENT3": "KNOWN SITES RESOURCES", "PairedEndSingleSampleWorkflow.dbSNP_vcf": "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf", "PairedEndSingleSampleWorkflow.dbSNP_vcf_index": "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx", "PairedEndSingleSampleWorkflow.known_indels_sites_VCFs": [ "gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz", "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz" ], "PairedEndSingleSampleWorkflow.known_indels_sites_indices": [ "gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi", "gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi" ],

Got it, but Geraldine explains in the link below that we can simply recalibrate bases without having to get involved with the time consuming mess of creating intervals for scattering and then merging back all those scatter results into a single file, so I assume what Geraldine explains is still valid, correct?

https://gatkforums.broadinstitute.org/gatk/discussion/2801/howto-recalibrate-base-quality-scores-run-bqsr

From shlee on 2017-07-17

Yes certainly @dilawerkh4. Our production pipeline’s preprocessing may use scattering but by no means do you have to follow suit.

From Guillaume_D on 2017-07-20

Hi,

My question is probably quite stupid but I just want to be sure:

I already ran the GATK best practices pipeline on a set of individuals (non model species so “home made” set of known SNPs by bootstrapping for BQSR). If I want to add a new set of individuals to my study, can I run the BQSR (with the same set of known SNPs as previously) on this new set, then call the gVCF files of these new individuals and finally run GenotypeGVCFs on all the recalibrated gVCF (new + old individuals, recalibrated separately) ? Or should I run the BQSR on all the individual together or do again a set of known SPNs with the two set of individual pooled (or should I do something else) ?

I guess the 1st option is the good one because otherwise you lost the advantage of this new calling method with gVCFs files and because finally the recalibration step is done at the individual level, right ?

Thanks,

Guillaume

From shlee on 2017-07-20

Hi @Guillaume_D,

Is your non-model species mammalian, where we can expect common variant sites, or another species for which you expect high divergence between samples? If the former, I believe it will be safe for you to use your prior known variants sites. If the latter, consider the extent of the differences and whether an updated known variants resource would impact your final variant calling results.

From Guillaume_D on 2017-07-21

Hi,

My species is a bird, and the populations compared are quite close. So, regarding your answer, I’m now quite confident I can use the same known variants sites.

Thanks a lot !

Guillaume

From jarda on 2017-07-31

Hi,

I frequently process a noteable amount of BAM files through GATK recalibration workflow. I understand why new BAM files with recalibrated records are twice or more larger than the original ones (mostly as they keepy original base qualities). But is there any way how to stop reporting original base qualities (I don’t need them) or how to quickly remove them from new BAM files? It could save me a lot of disc space.

From Geraldine_VdAuwera on 2017-08-01

Hi @jarda, Yes you can set the PrintReads tool to omit original base qualities— I forget the name of the argument but it should be in the tool docs somewhere. In fact I think in recent versions we get rid of them automatically and you have to request they be kept if you want them — what version are you using?

By the way, you can also disable indel qualities — they take up a lot of space and we’ve found that they don’t make much difference. We get rid of them too now in our internal pipelines.

From jarda on 2017-08-03

Hi @Geraldine_VdAuwera , thanks for your reply!

Currently, I am using gatk/3.5.0. I have to admit that those unwanted tags in records were indel qualities, not original base qualities (OQ – these were not generated in 3.5.0 yet). So, by using —disable_indel_quals, I achieved what I wanted. Thanks again!

From MattB on 2017-11-13

Hi, can anyone comment on how BQSR interplays with the NovaSeqs quality scores? The NovaSeq emits only 4 values of base quality score, these are not binned like with the NextSeq, they are just 4 hardwired values of Q: 2 (base not called correctly), 12 (92% call conf), 25 (99.7%), 37 (99.98%).

Also will the current 3.8 version of GATK handle this data correctly?

I know [previously](https://gatkforums.broadinstitute.org/gatk/discussion/4594/beware-of-using-binned-quality-scores-with-some-gatk-procedures “previously”) there was this discussion of how binned quality scores could effectively be “un-binned” or at least improved somewhat in granularity by running BQSR on the input but this was the case of 7 distinct bins, not 4 fixed values you’d expect from a NovaSeq.

Additionally has anyone got experience of what the variant calling impact of input data with only 4 base quality scores will be? I imagine this is not the best sort of data for a low allelic frequency somatic variant calling? But might have less an impact for say normal het and homo variants in say 30x germline calling?

From shlee on 2017-11-15

Hi @MattB,

Our cloud germline production pipeline (https://software.broadinstitute.org/gatk/documentation/article?id=7899) in fact uses binning such that after BQSR there are four main bins at 10, 20, 30 and 40 plus 0 to 6 base qualities as BQs<6 remain untouched. So you can be sure that this does not impact germline variant calling much, although there will be some minor differences in results compared to unbinned BQSR.

So long as the reads that support a population low frequency allele are present in a sample at good quality, then this should be called by our workflows. I read in a post by Sheila that our new QUAL model ([HaplotypeCaller](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--useNewAFCalculator) `-newQual`; also in GenotypeGVCFs and introduced in 3.7) helps with calling rare low frequency alleles. Perhaps Sheila can confirm.

I’m not familiar with NovaSeq data. Given you say the quals are already binned for this data, I wonder what the impact of BQSR would be for such data. It seems binning again for BQSR here for the data would be a moot point (I could be wrong) and that as you say you would want to keep the finer resolution that unbinned BQSR would result in. I’ll ask the team on this point.

From MattB on 2017-11-15

Hi @shlee, thanks for the answer, we are already using `-newQual` which is working well in our higher depth applications. And thanks for the link to that production pipeline, the WDL link contained therein is also very useful (I have recently been looking for that).

My curiosity is that the NovaSeq data is not binned it’s just that the machine it’s self only emits Q scores of 2, 12, 23 and 37 (note not 25 as I mistakenly mentioned above) Heng Li has this [rather interesting discussion](http://lh3.github.io/2017/07/24/on-nonvaseq-base-quality “rather interesting discussion”) on his blog on the matter which revels a tendency of the NovaSeq to slightly overestimate base quality. I’ve also unearthed this document the new [RTA3 Software](https://support.illumina.com/content/dam/illumina-marketing/documents/products/appnotes/novaseq-hiseq-q30-app-note-770-2017-010.pdf “RTA3 Software”) responsible for generating these qual scores.

From shlee on 2017-11-15

Hi @MattB,

I have an answer from a developer:

> You should continue to run BQSR, it will work on Novaseq data, and you should have BQSR output re-binned qualities. The original qualities may not be accurate, so BQSR will take care of fixing any issues. Finally, binned qualities should not affect downstream variant calling….

So it seems you should apply BQSR with already binned NovaSeq data and have the corrected base qualities be binned yet again.

Thanks for the links. I will read them with interest.

From MattB on 2017-11-16

Hey, thanks for this, just trying to figure out what arguments I now need to use in what tools, from my reading of this [article](https://software.broadinstitute.org/gatk/documentation/article.php?id=44 “article”) I gather I would use the `-qq` argument to output quantised base qualities and this being an engine level parameter I would use this with PrintReads?

Additionally would I want to specifically hardwire `-qq 4` here to ensure I’m still operating with 4 output bins, or should would I go with the levels which have been automatically figured out in the default quantisation table? (Assuming my reading of this behaviour is correct)

I also note that BaseReclibrator has a `-pl` parameter which controls the number of levels of quantisation to use to build the quantisation table, this defaults to 16, would I want to tweak that also? Or should I just stay well away from that one :)

From shlee on 2017-11-16

@MattB, I recently updated the BQSR slidedeck that we use in workshops for the September 2017 (1709) Helsinki workshop. My updates include showing how to bin and how our production performs BQSR using GATK4 tools. To enable users, I added GATK4 commands next to the GATK3 commands for performing the steps of BQSR.

You can find links to workshop materials on our [Presentations page](https://software.broadinstitute.org/gatk/documentation/presentations).

Here are the particular slides I think are of interest to you.

—-

Overview of changes between GATK3 and 4—notice the change in tools used for applying recalibration.

![](https://us.v-cdn.net/5019796/uploads/editor/1o/cnps5cpoqk35.png “”)

—-

BaseRecalibrator syntax changes. See the footnote’s linke for exactly which versions of tools our production uses for each step of BQSR.

![](https://us.v-cdn.net/5019796/uploads/editor/1i/1n1icvo5xo3l.png “”)

—-

This slide shows additional options including for binning base qualities.

![](https://us.v-cdn.net/5019796/uploads/editor/1i/ov0v0yq9jzhu.png “”)

—-

List of some finer details.

![](https://us.v-cdn.net/5019796/uploads/editor/no/vkx9gut30yp3.png “”)

From shlee on 2017-11-21

@MattB, there is an error in a slide above that was brought to my attention. BaseRecalibrator does not take the `-bqsr` parameter. To generate the 2nd recalibration table, you must use the recalibrated BAM with BaseRecalibrator. Apologies!

From MattB on 2017-11-21

Thanks was unaware of the `-sqq` arguments this is a help, and thanks for the correction re generating the remaining covariates for the BQSR before and after plot. You might also want to revise this [document](https://gatkforums.broadinstitute.org/gatk/discussion/2801/howto-recalibrate-base-quality-scores-run-bqsr “document”).

From Bhanuprakash on 2018-02-06

Hi @shlee

i am new to GATK analysis (using GATK4). i am get confused in BQSR 2nd recalibration step because 1st step of BaseRecalibrator gives the output in table form. how we create a recalibrated BAM file

Thanking you

From Sheila on 2018-02-12

@Bhanuprakash

Hi,

You need to run [ApplyBQSR](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_bqsr_ApplyBQSR.php) to get the recalibrated BAM like Soo Hee mentioned above. Then, run BaseRecalibrator on the recalibrated BAM.

-Sheila

From sutturka on 2018-02-14

> @Sheila said:

> Hi,

>

> You need to run [ApplyBQSR](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_bqsr_ApplyBQSR.php) to get the recalibrated BAM like Soo Hee mentioned above. Then, run BaseRecalibrator on the recalibrated BAM.

>

I was also confused with the BQSR 2nd recalibration step. Above comment is helpful but I am still unsure if my steps are correct. Essentially, I am running `BaseRecalibrator` and `ApplyBQSR` steps twice and then use the BAM file from second ApplyBQSR for downstream processing. Is this correct?

My commands are as below:

    1. BQSR – first pass

gatk BaseRecalibrator \

-I=sample_dedup.bam \

-R=/Reference/genome_ref.fasta \

—known-sites mysnp.vcf \

-O=**sample_recal_pass1.tabl**

gatk ApplyBQSR \

-I=sample_dedup.bam \

-R=/Reference/genome_ref.fasta \

—bqsr-recal-file sample_recal_pass1.tabl \

-O sample_dedup.recal.pass1.bam

    1. BQSR – second pass

gatk BaseRecalibrator \

-I=**sample_dedup.recal.pass1.bam** \

-R=/Reference/genome_ref.fasta \

—known-sites mysnp.vcf \

-O=**sample_recal_pass2.tabl**

gatk ApplyBQSR \

-I=sample_dedup.recal.pass1.bam \

-R=/Reference/genome_ref.fasta \

—bqsr-recal-file sample_recal_pass2.tabl \

-O sample_dedup.recal.pass2.bam

Then use the second pass BAM file `sample_dedup.recal.pass2.bam` for downstream processing.

    1. compare outputs
    2. gatk AnalyzeCovariates \
    3. —before-report-file sample_recal_pass1.tabl \
    4. —after-report-file sample_recal_pass2.tabl \
    5. —plots-report-file output.pdf

From sutturka on 2018-02-19

Can you please help to confirm the BQSR workflow and commands above.

From Sheila on 2018-02-19

@sutturka

Hi,

Yes, that is the correct workflow.

-Sheila

From kaboroevich on 2018-02-27

@Sheila

Are you really supposed to run ApplyBQSR twice as in @sutturka’s example? My understanding was that the second BaseRecalibrator was only to create a second table to provide to AnalyzeCovariates.

This would mean “sample_dedup.recal.pass**1**.bam” should be used downstream, not “sample_dedup.recal.pass**2**.bam”.

From sutturka on 2018-03-02

Following kaboroevich question. Sheila can you please confirm which file to use for downstream analysis, “sample_dedup.recal.pass1.bam” or “sample_dedup.recal.pass2.bam”. Appreciate your help.

From Sheila on 2018-03-05

@kaboroevich

Hi,

Yes, that is correct. Sorry @sutturka. You are indeed supposed to use sample_dedup.recal.pass1.bam in downstream analysis if all looks good with the AnalyzeCovariates plots.

-Sheila

From jageis on 2018-04-12

How important is it to run BQSR on a per read group basis? I ask because we have many read groups (~5000-10000) with reads across ~12,500 bases from a single MiSeq run.

From mjtiv on 2018-04-12

Question about BQSR plots from GATK4.0. In our data (3rd page of pdf—plot at bottom of page), the mean quality score plot appears to show more randomness and less consist cohesive behavior. I was wondering if you could clarify what the “Context Covariate (3 base suffix) plot should show (good data versus bad data)? All the prior plots are easier to interpret if the data was cleaned up or not.

output_BQSR_plots.pdf

From Sheila on 2018-04-17

@jageis

Hi,

It is more important to run on all reads that were run in the same lane. I hope that may reduce the runs for you.

-Sheila

From Sheila on 2018-04-17

@mjtiv

Hi,

I don’t think it is a case of “good vs bad” data. The plot shows the mean base quality of the bases before and after BaseRecalibration at the sites that follow those particular context covariates. Overall, the qualities go down following BaseRecalibration for those sites. I hope this makes sense.

-Sheila

From nute11a on 2018-05-25

Hello,

I am trying to run the first step of BaseRecalibrator (the first pass listed above). My command seemed to run without error but did not create an output table. I am calling SNPs on a non-model organism and using an un-recalibrated VCF file of putative SNPs as input, according to GATK recommendation.

My command:

gatk BaseRecalibrator \ -I bam_file \ -R reference_fasta \ —known-sites un-recalibrated VCF file of SNPs \ -O recal_data.table

There was no recal_data.table created though my command ran without error. My BAM file has been sorted, filtered for low quality reads, has duplicates marked, and I added read groups. Am I missing an extra step? What could be going on here?

From the output:

INFO BaseRecalibrator – Calculating quantized quality scores…

INFO BaseRecalibrator – Writing recalibration report…

INFO BaseRecalibrator – …done!

Thank you!

From Sheila on 2018-05-31

@nute11a

Hi,

Very odd. Can you try running [ValidateSamFile](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_ValidateSamFile.php) on your input BAM and [ValidateVariants](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_variantutils_ValidateVariants.php) on your input VCF?

Thanks,

Sheila

From mfallahi on 2018-06-12

Hi, I was wondering if the number of bases in each bin are taken into account when RMSE is calculated in Reported Quality vs Empirical Quality scatter plot?

Thanks,

Mo

From Sheila on 2018-06-19

@mfallahi

Hi Mo,

I guess the RMSE (root mean squared error) does take into account the number of bases indirectly. The formula for calculation is

RMSE = sqrt(sum((QualityScore-EmpiricalScore)^2 * numberOfObservations) / sum(Observations)).

-Sheila

From jesson on 2018-06-27

Dear GATK team, Recently, I downloaded the code of GATK. And I felt a confused about the algorithm of BQSR. In BaseRecalibrator, I could understand the empirical quality is a MAP of product of the Gaussian prior and the likelihood equivalent to binomial distribution. But I don’t understand the model in printReads, you use a hierarchicalBayesianQualityEstimate which used the three table(recal table 0, recal table 1, recal table 2) as conditional prior and then evaluate the MAP to recalibrate the quality score. what is the meaning of “hierachical”? Could you write the method behind printReads with probability decompositon form? I’m really curious about the model you use to recalibrate the quality score with very good performance. Thank you in advance.

From Sheila on 2018-06-28

@jesson

Hi,

>But I don’t understand the model in printReads, you use a hierarchicalBayesianQualityEstimate which used the three table(recal table 0, recal table 1, recal table 2) as conditional prior and then evaluate the MAP to recalibrate the quality score. what is the meaning of “hierachical”?

Where did you read this?

> Could you write the method behind printReads with probability decompositon form?

Since PrintReads is now replaced by ApplyBQSR in GATK4, perhaps [the code](https://github.com/broadinstitute/gatk/blob/984e783609fadb254e63e78b41bb71c154c6f087/src/main/java/org/broadinstitute/hellbender/tools/walkers/bqsr/ApplyBQSR.java) will help.

-Sheila

From jesson on 2018-06-29

@Sheila

Hi, thanks for the kind reply.

I read the code from [here](https://github.com/broadinstitute/gatk/blob/14d71914fb97f163a975c13532430fe935e930a3/src/main/java/org/broadinstitute/hellbender/transformers/BQSRReadTransformer.java)

> Qrecal = Qreported + DeltaQ + DeltaQ(pos) + DeltaQ(dinuc) + DeltaQ( … any other covariate … )

And I want to know how to explain this equation using a probabilty structrue or the hierarchicalBayesianQualityEstimate you defined. Any mathematical explaination about this will help me feel better.

Thanks,

ls

From Sheila on 2018-07-02

Hi Is,

I am checking with the team and will get back to you.

-Sheila

From jesson on 2018-07-04

@Sheila

Ok, I will wait until you get back. Thank you!

From falker on 2018-08-10

Hi,

when I started my analysis 1 year ago (24 WGS on pigs) I assigned one read group to all reads – because I didn’t know better. Each sample consists of data from about 4 lanes. I completed an imputation project based on these 24 founder animals involving more than 2000 animals. The results look very reliable compared to chip genotyping and what is already published.

Anyway, do you GATK guys have any experience how that mistake might influence the outcome of a variant calling? As far as I am concerned, in the old days (e.g. using a samtools pipeline) BQSR was not performed at all and running BQSR without read groups might still be better than not running it all. Do my assumptions make sense?

Thank you for your support!

From Sheila on 2018-08-13

@falker

Hi,

Could you tell us a bit more about your end goal?

The major issue with no read groups is indeed in pre-processing. You may be correct that BQSR is better done than not done :smile: However, since lane artifacts can have a great impact, it is possible that some artifacts were masked out by other lanes and you missed some issues that could have been corrected. Also, in variant calling, if you would like individual sample genotypes, it is best to define your read groups. In your case, it sounds like everything worked and your variant calls look good so perhaps there is nothing to worry about.

-Sheila

From Sheila on 2018-08-13

@jesson

Hi again Is,

I never got back to you, but I am not sure what exactly you are asking for. The equation you posted is exactly how the final quality score is recalibrated. It adds up all the changes based on covariates that need to be applied to each site. There really is no fancy calculation other than that. I hope I am making sense.

-Sheila

From falker on 2018-08-16

Hi Sheila,

my end goal is to use 60k Chip Data of an F2 generation that was created with ~2500 animals and impute that data using 24 deep sequence F0 and 90 low coverage sequence F1 animals.

So the workflow is:

Variant calling on F0 -> Variant calling on F1 -> take high confident F1 SNPs (no INDELs) and impute with F0 data -> impute F2 chip data with F0+F1_imputed data.

This workflow increased my variation data from 60k to ~24 million.

I ran a GWAS with the final dataset, and peaks are highly similar compared to the 60k data, but of course resolution is much greater. Concerning BQSR, the final dataset relies heavily on the F0 w/o read groups…

From Sheila on 2018-08-23

@falker

Hi,

Sorry for the delay. I need to think/check with the team and get back to you.

-Sheila

From fengtao on 2018-09-16

Hi,

I work on Arabidopsis and now am running the BQSR. I have questions about the -knownSites option:

I have a population-level snp.vcf file which is quite large and I also have individual-level snp.vcf files. Should I use the population-level one or just choose one of the individuals? As the population-level vcf file is quite large, I am worried it will slow down the process.

The indels and snps are included in a single vcf file, should I split them into independent files for the BQSR?

From jesson on 2018-09-26

@Sheila

Hi,

Thanks for your reply, but I still think the whole changes can be explained using a simplified and beatiful probability structure. I’m sorry to not notice your reply theses days in this post, because I’m waiting for your reply in this [post](https://gatkforums.broadinstitute.org/gatk/discussion/comment/52122#Comment_52122) and I feel very sad when I read the news that you will leave GATK, and will never came back forever T_T.

But still thank you very much for doing all the things to us strangers. Good luck to you!

From manolis on 2018-09-27

Hi, a question about the following pipeline of sutturka:

Why I have to use the “sample_dedup.recal.pass2.bam” for the downstream analysis? … I would have to use the first one, “sample_dedup.recal.pass1.bam” !?

What is the different between the 1st or the 2nd (double recalibrated) bam ?

>

sutturka said: > >

Sheila said:

> > Hi,

> >

> > You need to run [ApplyBQSR](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_bqsr_ApplyBQSR.php) to get the recalibrated BAM like Soo Hee mentioned above. Then, run BaseRecalibrator on the recalibrated BAM.

> >

>

> I was also confused with the BQSR 2nd recalibration step. Above comment is helpful but I am still unsure if my steps are correct. Essentially, I am running `BaseRecalibrator` and `ApplyBQSR` steps twice and then use the BAM file from second ApplyBQSR for downstream processing. Is this correct?

>

> My commands are as below:

>

> # BQSR – first pass

>

> gatk BaseRecalibrator \

> -I=sample_dedup.bam \

> -R=/Reference/genome_ref.fasta \

> —known-sites mysnp.vcf \

> -O=**sample_recal_pass1.tabl**

>

> gatk ApplyBQSR \

> -I=sample_dedup.bam \

> -R=/Reference/genome_ref.fasta \

> —bqsr-recal-file sample_recal_pass1.tabl \

> -O sample_dedup.recal.pass1.bam

>

> # BQSR – second pass

>

> gatk BaseRecalibrator \

> -I=**sample_dedup.recal.pass1.bam** \

> -R=/Reference/genome_ref.fasta \

> —known-sites mysnp.vcf \

> -O=**sample_recal_pass2.tabl**

>

> gatk ApplyBQSR \

> -I=sample_dedup.recal.pass1.bam \

> -R=/Reference/genome_ref.fasta \

> —bqsr-recal-file sample_recal_pass2.tabl \

> -O sample_dedup.recal.pass2.bam

>

> Then use the second pass BAM file `sample_dedup.recal.pass2.bam` for downstream processing.

>

> # compare outputs

> gatk AnalyzeCovariates \

> —before-report-file sample_recal_pass1.tabl \

> —after-report-file sample_recal_pass2.tabl \

> —plots-report-file output.pdf

From sergey_ko13 on 2018-09-27

hi @manolis

that was corrected:

https://gatkforums.broadinstitute.org/gatk/discussion/comment/46608/#Comment_46608

From manolis on 2018-09-29

Perfect, many thanks! I missed it

From OliviaChen on 2019-12-02

(Sorry to interrupt.I figured out what the problem was)