howto Recalibrate base quality scores run BQSR

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2013-06-17

Objective

Recalibrate base quality scores in order to correct sequencing errors and other experimental artifacts.

Prerequisites

Steps

Analyze patterns of covariation in the sequence dataset
Do a second pass to analyze covariation remaining after recalibration
Generate before/after plots
Apply the recalibration to your sequence data

1. Analyze patterns of covariation in the sequence dataset

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R reference.fa \ -I input_reads.bam \ -L 20 \ -knownSites dbsnp.vcf \ -knownSites gold_indels.vcf \ -o recal_data.table

Expected Result

This creates a GATKReport file called recal_data.table containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.

It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.

Note that -L 20 is used here and in the next steps to restrict analysis to only chromosome 20 in the b37 human genome reference build. To run against a different reference, you may need to change the name of the contig according to the nomenclature used in your reference.

2. Do a second pass to analyze covariation remaining after recalibration

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ -T BaseRecalibrator \ -R reference.fa \ -I input_reads.bam \ -L 20 \ -knownSites dbsnp.vcf \ -knownSites gold_indels.vcf \ -BQSR recal_data.table \ -o post_recal_data.table

Expected Result

This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the -BQSR flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table.

3. Generate before/after plots

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ -T AnalyzeCovariates \ -R reference.fa \ -L 20 \ -before recal_data.table \ -after post_recal_data.table \ -plots recalibration_plots.pdf

Expected Result

This generates a document called recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data. For details on how to interpret the base recalibration plots, please see the online GATK documentation.

4. Apply the recalibration to your sequence data

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R reference.fa \ -I input_reads.bam \ -L 20 \ -BQSR recal_data.table \ -o recal_reads.bam

Expected Result

This creates a file called recal_reads.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ.

Notice how this step uses a very simple tool, PrintReads, to apply the recalibration. What’s happening here is that we are loading in the original sequence data, having the GATK engine recalibrate the base qualities on-the-fly thanks to the -BQSR flag (as explained earlier), and just using PrintReads to write out the resulting data to the new file.

Updated on 2017-03-02

From Geraldine_VdAuwera on 2016-02-11

Ah, that was a mistake in the text. Fixed now. Thanks for pointing it out.

From armarkes on 2016-04-13

Can you please tell me if I have to run “BaseRecalibrator” and create a “recal_data.table” per BAM file, or if I can create only one table using all my bam files as input (each BAM file is from a different sample). Does it make sense?

This way can I use the same “recal_data.table” to do the PrintReads of all BAM files?

Thanks

From Sheila on 2016-04-14

@armarkes

Hi,

As long as your read groups are properly labeled, you can feed all the bams into BaseRecalibrator at once. [This article](https://www.broadinstitute.org/gatk/guide/article?id=3060) may help as well.

-Sheila

From armarkes on 2016-04-15

Thanks,

Ana

From armarkes on 2016-04-19

Regarding the same issue, I did that question because all my reads were obtained in the same lane. And I have several samples that were sequenced at the same time.

What I really wanted to know is: if I do the recalibration step using all the bam files (each is from different sample) as input, will I get a better recalibration table? Or it will be the same if I do a table per sample?

Each sample has a different read group, because I wanted to distinguish them if they are all merged in a bam file.

Thanks

Ana

From Sheila on 2016-04-20

@armarkes

Hi Ana,

BQSR should be done on all the reads in one lane. So, if your samples have been sequenced on the same lane, and the read group reflects that, you will be fine. Have a look at [this page](https://www.broadinstitute.org/gatk/guide/article?id=6472) for more help on read groups.

-Sheila

P.S. I hope [this article and following thread](http://gatkforums.broadinstitute.org/gatk/discussion/44/base-quality-score-recalibration-bqsr) will help too.

From armarkes on 2016-04-21

Thanks for your answer. Yet I have a doubt.

Since the only difference I have between my BAM files are the DNA samples (because library was the same, and they were sequenced all together in the same lane of the flowcell), should I run BaseRecalibrator in each bam file independently and create a recal.table for each sample and then run PrintReads in each bam file using its recal.table?

If I do this I will get the graphics (AnalyzeCovariates) per sample, using the recal.tables obtained per sample.

However I read somewhere in this forum that we should use the same recal.table to apply to all bam files (in PrintReads tool) from the same flowcell lane. And it makes sense because the sequencing error will be systematic through all samples.

And if I have only one recal.table generated from all my bam files, I will have only one graphic with pre and post recalibartion procedure.

I would like to know which option is recommended by you. Create only one recal.table using all my bam files as input, or create a recal.table per bam file?

Thanks

From Geraldine_VdAuwera on 2016-04-21

@armarkes Both ways are valid.

In the simplest case, where we have one library (LB) of one sample (SM) run on one lane, we run recalibration independently on the bam file that contains this read group’s data. In reality we usually run multiple other samples or libraries on the same lane (this is called multiplexing) so from a single lane we get multiple read groups. In our production pipeline this produces one bam file for each read group, and we run recalibration independently on each one (all with their own recal tables) even if they came from the same lane. If your pipeline produces a single bam file containing multiple read groups (that are appropriately distinguished) coming from the same lane, you can run base recalibration on that. You can even produce a single recalibration file by running BaseRecalibrator on multiple bams if you want, though it may take more time and memory.

The results will be the same in all cases because internally the processing is done per read group. This allows us to be able to recalibrate for errors that arise at the library level as well as at the lane level.

I think AnalyzeCovariates should produce separate graphics for each read group but I may be wrong, it has been a long time since I tried to run it that way.

From armarkes on 2016-04-21

Thanks for your answer.

Considering the information I read here https://www.broadinstitute.org/gatk/guide/article?id=6472, I created the read groups with BWA and I gave the same RG ID to all my sam files (because all my fastq data came from only one single-lane flow cell) but I gave different sample names (RG SM) to distinguish my samples in the @RG header. I had doubts on that so can you please tell me if I did it right?

example: @RG ID:M01600:44:1 SM:AA551 PL:ILLUMINA PI:250

Thanks again.

From Geraldine_VdAuwera on 2016-04-21

No that’s wrong — all the RG IDs need to be different. See the GATK dictionary for more details on read groups.

From Guillefriis on 2016-05-10

Hi,

As suggested in the best practice workflow, since I’m not working with humans, I have performed a SNPcalling with restrictive quality thresholds during the haplotype calling so I can used the detected variants as known sites in the BQSR:

for bams in $files

do filename=$(basename “$bams” _realg.bam)

java -jar $gatk \ -T HaplotypeCaller \ -I $bams \ -R $genome \ —min_base_quality_score 40 \ —min_mapping_quality_score 40 \ —emitRefConfidence GVCF \ —variant_index_type LINEAR \ —variant_index_parameter 128000 \ -nct 16 \ -o ${filename}.g.vcf

done

Note that since I’m using as a reference a published genome from a diferente species than the one I’m genotyping, I’m not interested in variants wit respect the reference genome but among my own samples. Then, I filter the vcf file to keep the multiallelic variants:

#—— JOINT GENOTYPES

java -jar $gatk \ -R $genome \ -T GenotypeGVCFs \ -V 04N4206.g.vcf \ -V 04N4219.g.vcf \ -V 11004.g.vcf \ -V 11005.g.vcf \ -V 12204.g.vcf \ -V 12M158.g.vcf \ -V 12M159.g.vcf \ -o JUNCO_variants_q40Q40.vcf \ -nt 16

#—— SELECT VARIANT TOOL

java -jar $gatk \ -T SelectVariants \ -V JUNCO_variants_q40Q40.vcf \ -R $genome \ -o JUNCO_variants_q40Q40_filtered.vcf \ -nt 16 \ -restrictAllelesTo MULTIALLELIC

I get then a really small dataset of ~3800 variants. The problem I’m having is that when I used the resulting dataset to recalibrated the bam files and perform the variant calling once again, I get an empty vcf file. Which can be the problem? Too few variants for the recalibration? Too low quality remaining SNPs?

Thanks a lot!

Guillermo

From Hugo_KPS on 2016-06-09

I am trying to recalibrate SNPs in a non-model organism. The information on the website tells me that I need a variant / known sites database, but in case that is not available, it recommends “For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation. “ I thought this is the GATK online documentation… so where else should I look?? can you put a link to the proper web page in the text?? Thanks in advance

From Sheila on 2016-06-10

@Hugo_KPS

Hi,

Have a look at [this article](http://gatkforums.broadinstitute.org/gatk/discussion/44/base-quality-score-recalibration-bqsr) under “I’m working on a genome that doesn’t really have a good SNP database yet. I’m wondering if it still makes sense to run base quality score recalibration without known SNPs.”

-Sheila

From medhat on 2016-06-17

IN the PrintReads step I had this error

ERROR MESSAGE: Bad input: The GATK report has no version specified in the header

I tried to use -rf BadCigar as suggested by other post like this

-T PrintReads -rf BadCigar -R genome.fa -Imarkduplicatrealigned.bam -BQSR markduplicatrealigned.table -o ./result/markduplicatrealignedrecal.bam

but I still have the same error any suggestion?

From Geraldine_VdAuwera on 2016-06-17

@Hugo_KPS That was copy-pasted from a paper manuscript, sorry for the confusion.

From Sheila on 2016-06-17

@medhat

Hi,

Are you using the latest version of GATK?

-Sheila

From Geraldine_VdAuwera on 2016-06-17

@medhat The program is complaining that there is something wrong with the recalibration table file. Have you opened it to see if the contents look ok?

From medhat on 2016-06-19

Sheila I am using GATK version 3.5 And the Issue as

Geraldine_VdAuwera said was in the .table file not in the .bam file as I understood, I reurn the step to get the .table file and now running the other step , I hope it will finish normally.

Thanks for help

From blaurent on 2016-08-10

This is a very naive question as I’m new in the field.

As discussed all over the documentation, BQSR is essential and cannot be dismissed.

So you should do it whatever variant calling analysis you want to perform.

But how do you get the known polymorphic site database?

I work on a bacteria (Bacillus subtilis) and I don’t even know if such database exists.

Thanks for your help.

From Sheila on 2016-08-10

@blaurent

Hi,

First you should check and find out if indeed there is such a database for your organism. If there is not, please refer to [this document](http://gatkforums.broadinstitute.org/gatk/discussion/44/base-quality-score-recalibration-bqsr) under “I’m working on a genome that doesn’t really have a good SNP database yet. I’m wondering if it still makes sense to run base quality score recalibration without known SNPs.”

-Sheila

From blaurent on 2016-08-10

@Sheila Thank you for your very quick answer. I’ll deep into that. Cheers

-Ben

From changpeng on 2016-11-22

Hi,

It’s my first time to use GATK.

What is the detailed mean of -L 20 ?

Thank you!

From Sheila on 2016-11-23

@changpeng

Hi,

Have a look at [the documentation](https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#—intervals) and [this article](https://software.broadinstitute.org/gatk/documentation/article?id=4133).

-Sheila

From RodenLuo on 2017-01-17

I think the answer to @changpeng ‘s question need to be briefly mentioned at [here](https://software.broadinstitute.org/gatk/documentation/article?id=2801). Because I go from GATK [Best practice](https://software.broadinstitute.org/gatk/best-practices/), to “Recalibrate Bases”, to [the tutorial](https://software.broadinstitute.org/gatk/documentation/article?id=2801). One should expect a newbie will read it. And since the hg19 bundle on GATK’s FTP uses UCSC style, which corresponds to `-L chr20` rather than `-L 20`. So I think this point should also be included in that tutorial. (Otherwise, users will get this `Badly formed genome location: Contig ‘20’ does not match any contig in the GATK sequence dictionary derived from the reference`. This will frustrate some users.) (Just a suggestion, it’s possible that I am just too naive and others may know this.)

The other thing I want to check is, seems that in the latest best practice, `IndelRealigner` has been removed (because I do not see it). If so, in this tutorial, it should use `dedup.bam` rather than currently used `realigned_reads.bam`.

It’s the first time I post here in `discussion`, I find very handy that this editor supports markdown. It would be great if there is a notice like “Markdown supported” below the editor.

Thanks!

From Geraldine_VdAuwera on 2017-01-18

@RodenLuo You’re right; I added a note and generalized the name of the input file. Thanks for reporting this.

We’ll see what we can do to highlight the markdown capabilities.

From mr_dave on 2017-02-28

Geraldine_VdAuwera said: >

RodenLuo You’re right; I added a note and generalized the name of the input file. Thanks for reporting this.

> We’ll see what we can do to highlight the markdown capabilities.

Step 2 mentions `-I realigned_reads.bam`. Is this also supposed to be the same BAM from Step 1 (`input_reads.bam`)? Or have I overlooked an input?

From Sheila on 2017-03-02

@mr_dave

Hi,

Yes, it is supposed to be input_reads.bam. Good catch :smile: I just fixed the document.

-Sheila

From helene on 2017-03-08

Hello,

I have two questions. 1. For the “-knowSites”, should the 1000G Indels be included as well. If not, is there a reason to it please? 2. I remember in some old version of GATK, we need to first create intervals before BQSR – is that skipped now? Thank you very much.

From Geraldine_VdAuwera on 2017-03-08

@helene

1. 1000G indels are already reported in the dbsnp file.

2. You may be thinking of RealignerTargetCreator, from the realignment step?

From helene on 2017-03-08

Geraldine_VdAuwera said: >

helene

> 1. 1000G indels are already reported in the dbsnp file.

> 2. You may be thinking of RealignerTargetCreator, from the realignment step?

Thanks so much for the quick reply.

From hh1985 on 2017-09-18

SBG’s GATK4 pipeline [https://igor.sbgenomics.com/public/apps?__hstc=64521200.d1e6ba8858b34c161a71becfc5ebfe13.1504981362166.1505716496548.1505754995022.8&__hssc=64521200.2.1505754995022&__hsfp=2341841478#admin/sbg-public-data/whole-genome-sequencing-bwa-gatk-4-0/](https://igor.sbgenomics.com/public/apps?__hstc=64521200.d1e6ba8858b34c161a71becfc5ebfe13.1504981362166.1505716496548.1505754995022.8&__hssc=64521200.2.1505754995022&__hsfp=2341841478#admin/sbg-public-data/whole-genome-sequencing-bwa-gatk-4-0/ “https://igor.sbgenomics.com/public/apps?__hstc=64521200.d1e6ba8858b34c161a71becfc5ebfe13.1504981362166.1505716496548.1505754995022.8&__hssc=64521200.2.1505754995022&__hsfp=2341841478#admin/sbg-public-data/whole-genome-sequencing-bwa-gatk-4-0/”), sets intervals to 20 (chr20) to speed up the BQSR modeling step.

> GATK BaseRecalibrator collects all the BAM files and uses only those covered with the BQSR intervals string input for creating the model for base quality score recalibration (BQSR). If the BQSR intervals string is not set, GATK BaseRecalibrator would work for more than 20 hours on a whole genome sample. For that reason, this input is set to required with the default value of 20 meaning only chromosome number 20 will be used for creating the model for BQSR.

Is this a reasonable way for general human sample processing?

From Frank_cumc on 2017-09-19

Hello, I am using the best practices pipeline and am trying to recalibrate the base qualities scores from a bam file generated after marking duplicates. I'm also using the hg19decoy.fasta from the resources section as reference and both dbsnp138 and the Millsand1000G_gold standard.indels for knownSites as indicated. My code is here

java -Xmx8G -jar $GATK -T BaseRecalibrator -R /home/frank/Documents/Resources/hg19/humang1kv37decoy.fasta -I 4779pipeddups.bam -knownSites /home/frank/Documents/Resources/hg19/dbsnp138.hg19.vcf -knownSites /home/frank/Documents/Resources/hg19/Millsand1000Ggoldstandard.indels.hg19.sites.vcf -o recal_data.table

java -Xmx8G -jar $GATK -T BaseRecalibrator -R /home/frank/Documents/Resources/hg19/humang1kv37decoy.fasta -I 4779pipeddups.bam -knownSites /home/frank/Documents/Resources/hg19/Millsand1000Ggoldstandard.indels.hg19.sites.vcf -o recaldata.table INFO 14:50:42,882 HelpFormatter - ---------------------------------------------------------------------------------- INFO 14:50:42,885 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50 INFO 14:50:42,885 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 14:50:42,885 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk INFO 14:50:42,885 HelpFormatter - [Tue Sep 19 14:50:42 EDT 2017] Executing on Linux 4.10.0-33-generic amd64 INFO 14:50:42,886 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0144-b01 INFO 14:50:42,890 HelpFormatter - Program Args: -T BaseRecalibrator -R /home/frank/Documents/Resources/hg19/humang1kv37decoy.fasta -I 4779pipeddups.bam -knownSites /home/frank/Documents/Resources/hg19/Millsand1000Ggoldstandard.indels.hg19.sites.vcf -o recaldata.table INFO 14:50:42,894 HelpFormatter - Executing as frank@ubuntu on Linux 4.10.0-33-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0144-b01. INFO 14:50:42,894 HelpFormatter - Date/Time: 2017/09/19 14:50:42 INFO 14:50:42,895 HelpFormatter - ---------------------------------------------------------------------------------- INFO 14:50:42,895 HelpFormatter - ---------------------------------------------------------------------------------- ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/home/frank/GATKsoftware/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console... INFO 14:50:43,039 GenomeAnalysisEngine - Deflater: JdkDeflater INFO 14:50:43,039 GenomeAnalysisEngine - Inflater: JdkInflater INFO 14:50:43,040 GenomeAnalysisEngine - Strictness is SILENT INFO 14:50:43,172 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 14:50:43,180 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 14:50:43,229 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.05

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://software.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Input files knownSites and reference have incompatible contigs. Please see https://software.broadinstitute.org/gatk/documentation/article?id=63for more information. Error details: No overlapping contigs found.

ERROR knownSites contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1gl000191random, chr1gl000192random, chr4ctg9hap1, chr4gl000193random, chr4gl000194random, chr6apdhap1, chr6coxhap2, chr6dbbhap3, chr6mannhap4, chr6mcfhap5, chr6qblhap6, chr6sstohap7, chr7gl000195random, chr8gl000196random, chr8gl000197random, chr9gl000198random, chr9gl000199random, chr9gl000200random, chr9gl000201random, chr11gl000202random, chr17ctg5hap1, chr17gl000203random, chr17gl000204random, chr17gl000205random, chr17gl000206random, chr18gl000207random, chr19gl000208random, chr19gl000209random, chr21gl000210random, chrUngl000211, chrUngl000212, chrUngl000213, chrUngl000214, chrUngl000215, chrUngl000216, chrUngl000217, chrUngl000218, chrUngl000219, chrUngl000220, chrUngl000221, chrUngl000222, chrUngl000223, chrUngl000224, chrUngl000225, chrUngl000226, chrUngl000227, chrUngl000228, chrUngl000229, chrUngl000230, chrUngl000231, chrUngl000232, chrUngl000233, chrUngl000234, chrUngl000235, chrUngl000236, chrUngl000237, chrUngl000238, chrUngl000239, chrUngl000240, chrUngl000241, chrUngl000242, chrUngl000243, chrUngl000244, chrUngl000245, chrUngl000246, chrUngl000247, chrUngl000248, chrUn_gl000249]

ERROR reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1, NC_007605, hs37d5]

ERROR ------------------------------------------------------------------------------------------

And my output has two errors, one regarding the apache.logging.log4j class creation and the other about incompatible contigs. I aligned the fasta files to the hg19_decoy.fasta.

I tried looking through the forums and other sites for solution but still can't figure out how to resolve the issue. It sounds like the log4j problem is minor as you can still run the code and get the tool to work. Is the contigs problem because the knownsites files were not aligned to the g1k_v37 but the ucsc hg19 file?

Thank you for the help

From Frank_cumc on 2017-09-19

I realized I needed to use the b37 files instead of the hg19 and that solved the problem. The Log4j2 error is still there although the program seems to be running fine.

From Sheila on 2017-09-21

@hh1985

Hi,

BaseRecalibrator needs a lot of data to develop a good model, so running on a single chromosome is not a good idea. Have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/4272/targeted-sequencing-appropriate-to-use-baserecalibrator-bqsr-on-150m-bases-over-small-intervals).

-Sheila

From hh1985 on 2017-09-22

@Sheila Thanks a lot! I understand that model based on single chromosome may not be generalizable on the whole genome. In order to speed up the BaseRecalibrator step, how about using splitIntervals to create a list of sub-interval files, and run BaseRecalibrator/ApplyBQSR on each of the intervals?

From Sheila on 2017-09-25

@hh1985

Hi,

No, that won’t work either. BaseRecalibrator needs to see all the data at once. You can run BaseRecalibrator on all the chromosomes together then run PrintReads on each chromosome, however. Just make sure you output all sites in your BAM file instead of just the intervals you are interested in because you don’t want to lose any data.

-Sheila

From hh1985 on 2017-09-25

@Sheila Thanks for the suggestion! For your last sentence, I am still confusing. My understanding is that for whole genome sequencing, the split chromosomes (intervals) should cover all the regions of the bam file. For targeted sequencing, the target interval file may not be enough. In such case, I may need to update the interval file based on the bam file (I don’t know what tool to use yet), and then do the splitting. Is that correct?

-Han

From Sheila on 2017-09-27

@hh1985

Hi Han,

Oh, I see. The reason we suggest running on just the exome intervals is to speed things up and not waste compute. However, there may be data outside the intervals of interest that you will not want to miss out on in the final BAM file. We expect those regions (outside the intervals of interest) to be small and not contribute much to the effectiveness of the tools’ models. But, those outside regions may come in handy later on when you want to do QC or sometimes in HaplotypeCaller when they are used in the realignment process.

Have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/4798/vqsr-using-capture-and-padding) and [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/2219/l-option) as well.

-Sheila

From Bhanuprakash on 2018-02-06

i am new to GATK. i am running the BQSR step(using GATK 4). the first step i completed using BaseRecalibrator i.e analyze patterns of covariation in the sequence dataset. but while doing the second step i.e the second pass to analyze covariation remaining after recalibration, it shows the error message “BQSR is not a recognized option”.

This is my command

java -jar gatk-package-4.0.0.0-local.jar BaseRecalibrator -R Homo_sapiens_assembly38.fasta -I dedup_reads.bam —known-sites all_snp.vcf —known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf —BQSR recal_data.table -O post_recal_data.table

please tell me what i will do.

From Sheila on 2018-02-12

@Bhanuprakash

Hi,

I will answer your question [here](https://gatkforums.broadinstitute.org/gatk/discussion/44/base-quality-score-recalibration-bqsr).

-Sheila

From ekanterakis on 2018-05-29

Can you describe the (pdf) output of `AnalyzeCovariates` in a bit more detail? I’m a bit puzzled by the context covariate. In the docs you mention that the context is “current base + previous base (dinucleotide)”, then in the recal_table you store a “context of a fixed size (default 6)”, and the AnalyzeCovariates plot’s x-axis says “Context Covariate (3 base suffix)”. What context is considered for recalibation by default and what exactly is plotted? Also, is it the reference base or the read base context that is plotted? I assume the latter. Thanks!

From Sheila on 2018-06-05

@ekanterakis

Hi,

Sorry for the confusion. There are many difference contexts the tool takes into account. I think there are ~17 contexts, including dinucleotide and trinucleotide contexts. The team has determined those contexts to have errors while sequencing. Have a look at [this thread](https://gatkforums.broadinstitute.org/gatk/discussion/11976/how-does-the-bqsr-step-not-create-bias-in-snp-detection) for more information I hope will help.

-Sheila

From foxDie00 on 2018-08-21

I’m trying to run Step 2 where I do a second pass to analyze covariation, but I keep getting an error whenever it tries to read the `-BQSR ` flag. This is my code for the command:

`gatk BaseRecalibrator \`

`-R=”$REF_GENOME_FASTA” \`

`-I=”$IN_UNRECALIBRATED_BAM” \`

`—known-sites=”$KNOWNSITES_SNP_VCF” \`

`—known-sites=”$KNOWNSITES_INDEL_VCF” \`

`-BQSR=”$RECALIBRATION_TABLE_1”`

`-O=”$RECALIBRATION_TABLE_2_HEADER”.table \`

`-TMP_DIR=$TMPDIR`

And this is the error that it is giving me:

`A USER ERROR has occurred: B is not a recognized option`

I have tried switching it to `-bqsr` as well to no avail. How can I successfully add this flag to do on-the-fly recalibration and get the 2nd pass recalibration table to use for `AnalyzeCovariates`? I also do not see the `-BQSR` flag description when I pull up the documentation for `BaseRecalibratior`.

(For reference, I am running GATK4 on a lustre/SLURM system)

From foxDie00 on 2018-08-21

I found on another thread that the `-BQSR` flag no longer works for on-the-fly recalibration in GATK4. Can you please update the documentation to reflect that? Specifically this article as it still says to use `BaseRecalibrator + BQSR` instead of `ApplyBQSR`.

From Sheila on 2018-08-28

@foxDie00

Hi,

Yes, the team is making a plan to update all these tutorials to showcase GATK4 commands. In the meantime, have a look at the powerpoints in the Presentations section. Those have the most recent workflows and commands in them.

-Sheila

From Zea1nfO on 2018-10-29

how can i read a recal.table?is there any tutorial?

From pwaltman on 2018-11-29

Sheila said: >

foxDie00

> Hi,

> Yes, the team is making a plan to update all these tutorials to showcase GATK4 commands. In the meantime, have a look at the powerpoints in the Presentations section. Those have the most recent workflows and commands in them.

> -Sheila

Where in the Presentations section can we find the tutorials? I’m looking at the googleDrive folder that is linked to on the Presentations main page, but it is a bit overwhelming, and the organization isn’t very clear. (https://software.broadinstitute.org/gatk/documentation/presentations; link to gdrive: https://drive.google.com/open?id=1dS9wr_h6nh3BhPp1KKTGyXJS3upTw4j0)

From pwaltman on 2018-11-29

pwaltman said: > >

Sheila said:

> > @foxDie00

> > Hi,

> >

> > Yes, the team is making a plan to update all these tutorials to showcase GATK4 commands. In the meantime, have a look at the powerpoints in the Presentations section. Those have the most recent workflows and commands in them.

> >

> > -Sheila

> Where in the Presentations section can we find the tutorials? I’m looking at the googleDrive folder that is linked to on the Presentations main page, but it is a bit overwhelming, and the organization isn’t very clear. (https://software.broadinstitute.org/gatk/documentation/presentations; link to gdrive: https://drive.google.com/open?id=1dS9wr_h6nh3BhPp1KKTGyXJS3upTw4j0)

Found one of the presentations here:

https://drive.google.com/file/d/10WXyCb8hdmAIEktEW3ag7sekKGO8P92a/view?usp=sharing

Which is contained in this directory:

https://drive.google.com/drive/folders/1aJPswWdmMKLSmXB6jjHMltXj6XDjSHLJ

Report abuse