Germline short variant discovery SNPs Indels

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2018-01-07

Important: This document is currently being updated

Purpose

Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.

Reference Implementations

| Pipeline | Summary | Notes | Github | Terra | |:-------------|:---------------|:---------|:-----------|:--------------| | Prod* germline short variant per-sample calling | uBAM to GVCF | optimized for GCP | yes | pending | | Prod* germline short variant joint genotyping | GVCFs to cohort VCF | optimized for GCP | yes | pending | | $5 Genome Analysis Pipeline| uBAM to GVCF or cohort VCF | optimized for GCP (see blog) | yes | hg38 | | Generic germline short variant per-sample calling | analysis-ready BAM to GVCF | universal | yes | hg38 | | Generic germline short variant joint genotyping | GVCFs to cohort VCF | universal | yes | hg38 & b37 | | Intel germline short variant per-sample calling | uBAM to GVCF | Intel optimized for local architectures | yes | NA |

* Prod refers to the Broad Institute's Data Sciences Platform production pipelines, which are used to process sequence data produced by the Broad's Genomic Sequencing Platform facility.

Expected input

This workflow is designed to operate on a set of samples constituting a study cohort. Specifically, a set of per-sample BAM files that have been pre-processed as described in the GATK Best Practices for data pre-processing.

Main steps for Germline Cohort Data

We begin by calling variants per sample in order to produce a file in GVCF format. Next, we consolidate GVCFs from multiple samples into a GenomicsDB datastore. We then perform joint genotyping, and finally, apply VQSR filtering to produce the final multisample callset with the desired balance of precision and sensitivity.

Additional steps such as Genotype Refinement and Variant Annotation may be included depending on experimental design; those are not documented here.

Call variants per-sample

Tools involved: HaplotypeCaller (in GVCF mode)

In the past, variant callers specialized in either SNPs or Indels, or (like the GATK's own UnifiedGenotyper) could call both but had to do so them using separate models of variation. The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate file called a GVCF , which can then be used for joint genotyping of multiple samples in a very efficient way. This enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes

In practice, this step can be appended to the pre-processing section to form a single pipeline applied per-sample, going from the original unmapped BAM containing raw sequence all the way to the GVCF for each sample. This is the implementation used in production at the Broad Institute.

Consolidate GVCFs

Tools involved: GenomicsDBImport

This step consists of consolidating the contents of GVCF files across multiple samples in order to improve scalability and speed the next step, joint genotyping. Note that this is NOT equivalent to the joint genotyping step; variants in the resulting merged GVCF cannot be considered to have been called jointly.

Prior to GATK4 this was done through hierarchical merges with a tool called CombineGVCFs. This tool is included in GATK4 for legacy purposes, but performance is far superior when using GenomicsDBImport, which produces a datastore instead of a GVCF file.

Joint-Call Cohort

Tools involved: GenotypeGVCFs

At this step, we gather all the per-sample GVCFs (or combined GVCFs if we are working with large numbers of samples) and pass them all together to the joint genotyping tool, GenotypeGVCFs. This produces a set of joint-called SNP and indel calls ready for filtering. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites, and produces a squared-off matrix of genotypes that provides information about all sites of interest in all samples considered, which is important for many downstream analyses.

This step runs quite quickly and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem.

Filter Variants by Variant (Quality Score) Recalibration

Tools involved: VariantRecalibrator, ApplyRecalibration

The GATK's variant calling tools are designed to be very lenient in order to achieve a high degree of sensitivity. This is good because it minimizes the chance of missing real variants, but it does mean that we need to filter the raw callset they produce in order to reduce the amount of false positives, which can be quite large.

The established way to filter the raw variant callset is to use variant quality score recalibration (VQSR), which uses machine learning to identify annotation profiles of variants that are likely to be real, and assigns a VQSLOD score to each variant that is much more reliable than the QUAL score calculated by the caller. In the first step of this two-step process, the program builds a model based on training variants, then applies that model to the data to assign a well-calibrated probability to each variant call. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

The downside of how variant recalibration works is that the algorithm requires high-quality sets of known variants to use as training and truth resources, which for many organisms are not yet available. It also requires quite a lot of data in order to learn the profiles of good vs. bad variants, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, on targeted sequencing data, on RNAseq, and on non-model organisms. If for any of these reasons you find that you cannot perform variant recalibration on your data (after having tried the workarounds that we recommend, where applicable), you will need to use hard-filtering instead. This consists of setting flat thresholds for specific annotations and applying them to all variants equally. See the methods articles and FAQs for more details on how to do this.

We are currently experimenting with neural network-based approaches with the goal of eventually replacing VQSR with a more powerful and flexible filtering process.

Main steps for Germline Single-Sample Data

Single sample variant discovery uses HaplotypeCaller in its default single-sample mode to call variants in an analysis-ready BAM file. The VCF that HaplotypeCaller emits errs on the side of sensitivity, so some filtering is often desired. To filter variants first run the CNNScoreVariants tool. This tool annotates each variant with a score indicating the model's prediction of the quality of each variant. To apply filters based on those scores run the FIlterVariantTranches tool with SNP and INDEL sensitivity tranches appropriate for your task.

Notes on methodology

The central tenet that governs the variant discovery part of the workflow is that the accuracy and sensitivity of the germline variant discovery algorithm are significantly increased when it is provided with data from many samples at the same time. Specifically, the variant calling program needs to be able to construct a squared-off matrix of genotypes representing all potentially variant genomic positions, across all samples in the cohort. Note that this is distinct from the primitive approach of combining variant calls generated separately per-sample, which lack information about the confidence of homozygous-reference or other uncalled genotypes.

In earlier versions of the variant discovery phase, multiple per-sample BAM files were presented directly to the variant calling program for joint analysis. However, that scaled very poorly with the number of samples, posing unacceptable limits on the size of the study cohorts that could be analyzed in that way. In addition, it was not possible to add samples incrementally to a study; all variant calling work had to be redone when new samples were introduced.

Starting with GATK version 3.x, a new approach was introduced, which decoupled the two internal processes that previously composed variant calling: (1) the initial per-sample collection of variant context statistics and calculation of all possible genotype likelihoods given each sample by itself, which require access to the original BAM file reads and is computationally expensive, and (2) the calculation of genotype posterior probabilities per-sample given the genotype likelihoods across all samples in the cohort, which is computationally cheap. These were made into the separate steps described below, enabling incremental growth of cohorts as well as scaling to large cohort sizes.

Attachments:

BP_single_sample_variant-01.png

Updated on 2019-07-14

From moxu on 2018-03-23

Is there a step by step instruction for Germline short variant discovery (SNPs + Indels) like the one for RNA-seq variant calling? I don’t know WDL at all.

Thanks!

From shlee on 2018-03-23

Hi @moxu,

Let’s kill two birds with one stone then.

See https://gatkforums.broadinstitute.org/gatk/discussion/7899/reference-implementation-pairedendsinglesamplewf-pipeline for an implementation of germline short variant discovery that also explains WDL script components. The commands you would run manually are pretty much the same as that outlined in the second half of the document, under TASK DEFINITIONS. Note that there is an updated workflow of the same name in [gatk-workflows](https://github.com/gatk-workflows).

For considerations in germline variant discovery, see the following tutorials:

- https://software.broadinstitute.org/gatk/documentation/article?id=6483

- https://software.broadinstitute.org/gatk/documentation/article?id=8017

And it looks like one of our workshop tutorials is now online:

- https://software.broadinstitute.org/gatk/documentation/article?id=7869

From rourich on 2018-05-07

Hi,

I am using BaseRecalibrator for the Germline short variant discovery (SNPs + Indels) workflow. However, I don’t know what know sites should I use.

Since I have mapped my reads against GRCh38 from [here](http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/ “here”), I understand that I should download the know sites files from [this bundle](ftp://ftp.broadinstitute.org/bundle/hg38/ “this bundle”). However, [this article](https://software.broadinstitute.org/gatk/documentation/article.php?id=1247 “this article”) recommends the use of “1KG indels” which in fact is not stored within the hg38 bundle (dbSNP and “Mills indels” are stored in the hg38 bundle).

I realize that “1KG indels” file is stored in [b37 bundle](ftp://ftp.broadinstitute.org/bundle/b37/ “b37 bundle”). However, b37 is another reference…

I would like you to confirm me if using only dbSNP and “Mills indels” as know sites for BaseRecalibrator with hg38 is correct and that I don’t have to use “1KG indels” so that my results don’t have false positives.

Thank you very much, best regards

From shlee on 2018-05-07

Hi @rourich,

The most up-to-date information can be had at our [gatk-workflows](https://github.com/gatk-workflows) repository. In particular, for GRCh38, you will be interested in the [broad-prod-wgs-germline-snps-indels](https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels) workflow input files which are given [here](https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels/blob/master/PairedEndSingleSampleWf.hg38.inputs.json). Within this json, you see the indel files:

``` “PairedEndSingleSampleWorkflow.known_indels_sites_VCFs”: [ “gs://broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz”, “gs://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz“ ], “PairedEndSingleSampleWorkflow.known_indels_sites_indices”: [ “gs://broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi”, “gs://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi“ ],

```

These are accessible to you from the Google bucket linked in [our Resource Bundle page](https://software.broadinstitute.org/gatk/download/bundle) and I think should also be accessible at `gs://broad-references/hg38/v0`. You can access public gcloud storage files from the command-line using `gsutil`, which is available when you install [Google Cloud SDK](https://cloud.google.com/sdk/downloads).

From ipeddie on 2018-05-18

Hi there,

I’m trying to follow the Best Practices workflow using GATK4 locally.

I have done my pre-procession (eg running BaseRecalibrator and ApplyBQSR) using GATK4 and looking to create the gVCF for each of my samples.

Having read thru the WDL, and also https://github.com/gatk-workflows/gatk4-germline-snps-indels ‘s README, it seems that for HaplotypeCaller you DO NOT recommend using GATK4 but going back to GATK3 version for this step. Is this correct?

Hope you can help.

Thanks.

Eddie

From Sheila on 2018-05-21

@ipeddie

Hi Eddie,

Sorry for the confusion. You can use GATK4 HaplotypeCaller, as it has been validated and is good to use. I will let the team know to fix the wording in that doc.

-Sheila

From splaisan on 2018-05-22

Hi,

A bit on the same line (I think!). I am still using the command-line approach and have difficulties finding the recommended reference files for variant recalibration.

From older v3 pages, I figured out a command below but do not know how to set the ‘prior’ values and do not know wether I chose the right —resource vcf.gz data from the bucket.

Is there a place where I can find rationale explaining these choices (especially the priors)?

Also, I noticed a number of commands where the number of ‘dash’ signs before a named argument have changed between one an two (eg —tranches-file below). This is quite confusing, especially when posted tutorials (v3) are not running anymore under v4). Any plan of creating v4 versions of the CLI analysis tutorial pages (I am sure others will prefer good old CLI over Jason and other management systems.)?

Thanks for your help

Stephane

```

Variantrecalibrator
java -jar $GATK/gatk.jar VariantRecalibrator \ -R $BWA_INDEXES/NCBI_GRCh38.fa \ -V ${samplename}.vcf.gz \ —resource hapmap,known=false,training=true,truth=true,prior=15.0:../reference/hg38_v0_hapmap_3.3.hg38.vcf.gz \ —resource Mills_and_1000G_gold,known=false,training=true,truth=false,prior=12.0:../reference/hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \ —resource omni,known=false,training=true,truth=false,prior=12.0:../reference/hg38_v0_1000G_omni2.5.hg38.vcf.gz \ —resource dbsnp,known=true,training=false,truth=false,prior=2.0:../reference/hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.gz \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ -mode BOTH \ —output output.recal \ —tranches-file output.tranches \ —rscript-file output.plots.R
```

From splaisan on 2018-05-23

I found the answer on that page [https://gatkforums.broadinstitute.org/gatk/discussion/1259/which-training-sets-arguments-should-i-use-for-running-vqsr](https://gatkforums.broadinstitute.org/gatk/discussion/1259/which-training-sets-arguments-should-i-use-for-running-vqsr “https://gatkforums.broadinstitute.org/gatk/discussion/1259/which-training-sets-arguments-should-i-use-for-running-vqsr”)

Q: What is still open is wether one should split the variants in SNPs and INDELS and calibrate them separately, apply and merge the recalibrated results OR can we call them both with BOTH and the sum of all resources as above.

From ipeddie on 2018-05-24

Sheila said: >

ipeddie

> Hi Eddie,

> Sorry for the confusion. You can use GATK4 HaplotypeCaller, as it has been validated and is good to use. I will let the team know to fix the wording in that doc.

> -Sheila

Thanks for the reply Shelia

From Sheila on 2018-05-30

@splaisan

Hi Stephane,

I see. Thanks for bringing this to our attention. I think the tool docs should help. Also, the presentation from GATK3 VQSR should help as well [here](https://drive.google.com/drive/folders/0BwTg3aXzGxEDNGtSazZoYWRNY3c).

For your question, I think slides 11 onwards will help in the presentation I linked to. You should run VariantRecalibrator and ApplyRecalibration twice (once in SNP mode and once in INDEL mode).

I hope that helps.

-Sheila

From Drdeepak on 2018-06-13

Hi this is Dr. Deepak

I am using the GATK for non model organism (an insect) for which i have three samples say s1, s2, s3

I make a database of snps using hard filter a vcf file (produced from the gvcf to vcf file). Now the question is for base recalibration should i use each sample separately that will produce three bsqr table in each iteration or i combine all the three in a single command (by multiple I option) so that it will produce a single bsqr table in each iteration.

From Sheila on 2018-06-19

@Drdeepak

Hi,

I responded [here](https://gatkforums.broadinstitute.org/gatk/discussion/comment/49308#Comment_49308).

-Sheila

From sprocha on 2018-07-06

Hi!

not sure if best place to ask this is here but here it goes:

I am doing germline VC with GATK4.0.0 (single sample HC and then joingVC). I am trying to find which read filters HC is using by default and not managing to find it easily… going through the 41 possible read filters mentioned in the documentation, only a couple mention default values and I find no explicit info about this…

although I suspect it does is trying to filter out things, as I find the following in my logs:

``` 22:54:01.508 INFO HaplotypeCaller – No reads filtered by: ((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter) ```

(appears in the end…)

I used the following cmds when running HC:

``` time ( gatk HaplotypeCaller \ -I mySample.bam \ -O mySample.g.vcf.gz \ -R $reference \ -ERC GVCF \ -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation ) ```

Because I want to play with the impact of different filters (reads and variant), is really imptte for me to get this right.

Seems to be better documented in previous versions than here (although I am missing something)..

any help really appreciated!

many thanks in advance…

ps: could u point me to the answer of this as well in v 3.7 ?

From shlee on 2018-07-06

Hi @sprocha,

Yes, there is a very easy way to see the list of filters used by a tool. For GATK4 call up the tool help, e.g. `gatk HaplotypeCaller` and look for the parameter `—disable-read-filter`. It will list all of the filters that are disable-able, i.e. that are being used by default.

```

—disable-read-filter,-DF:String Read filters to be disabled before analysis This argument may be specified 0 or more times. Default value: null. Possible Values: {GoodCigarReadFilter, MappedReadFilter, MappingQualityAvailableReadFilter, MappingQualityReadFilter, NonZeroReferenceLengthAlignmentReadFilter, NotDuplicateReadFilter, NotSecondaryAlignmentReadFilter, PassesVendorQualityCheckReadFilter, WellformedReadFilter}

```

For GATK3, please refer to GATK3 versioned documentation.

From sprocha on 2018-07-10

Great, many thanks!!

From sprocha on 2018-07-11

Hi,

new question:

according to GATK HC best practices I asked this in my tests (still running):

[-G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation]

I’ll be needing to do hard filtering, and actually will want to play with some filters, but right now struggling with:

how do I know which annotations will be in my files?… of the 47 possible found in the link bellow:

https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/#VariantAnnotations

…all ?

thanks!

From sprocha on 2018-07-16

Hi again,

question is about hard filtering non-model organism|capture data| many species (= no known variants). I see there are 2 tools for that: “FilterVCF” and “VariantFiltration”.

As I understood it, In FilterFVC we can apply: min_FS, min_AB, min_DG, min_GQ and min_QD. “VariantFiltration” seems a more general tool where we can filter (all?) variant annotations that we want… you have the following “advices” about the most informative (hard) filters and their values:

```

For SNP’s For INDELS:

QD > 2 QD < 2

FS > 60 FS > 200

MQ > 40 (RMSMappQual) ReadPosRankSum < 20

MQRankSum < -12.5 SOR > 10

ReadPosRankSum < -8

SOR > 3

```

I understand that these are indications and we must find a way of doing the best with our data… but after reading a couple papers that highlight the importance of GQ and AB… (for example this: DeSumma et al 2017 (https://doi.org/10.1186/s12859-017-1537-8) )… I am wondering:

1) why these 6 are still given as the 6 most informative? what is the basis for this, and is it current practice to filter based on them, and on these thresholds?

2) why does FilterVCF exists, why does it implements other filters, and is there any difference/advantage in using it versus VarianFiltration?

and I guess it all resumes to:

3) what strategy would you currently follow to filter your variants in data for which known variants (and thus VQSR is impossible), especially target-capture data (non-uniform depth of coverage).

any advice, good reference, really welcome,

many thanks,

sara

From Sheila on 2018-07-16

@sprocha

Hi Sara,

1) The team did some testing back in the day and found those annotations were the best for keeping sensitivity and specificity high. We still recommend those annotations for filtering as they work well.

2) FilterVCF is a Picard tool. Picard and GATK were separate toolkits before GATK4. I think the developers of each tool must have had different purposes in mind. Now that GATK4 simply “sucks in” Picard, the tools coexist. It looks like VariantFiltration does everything FilterVCF does and more, so I would stick to VariantFiltration.

3) I hope [this article](https://software.broadinstitute.org/gatk/documentation/article?id=11069) will help.

-Sheila

From sprocha on 2018-07-23

ok, thanks a lot again!

From water111 on 2018-07-24

hi, I am trying to use the GATK4 to call germine SNP&Indel, while for the BQSR setp, gatk —list cant’t find the tool “RealignerTargetCreator” and “IndelRealigner” that can be used in GATK3.7 , is this two tools are just didn’t use in GATK4 or are replaced by other tools (searched and didn’t find them), thanks for your instructions!

From gregputzel on 2018-07-26

Hi,

In the “Consolidate GVCFs” section, this document refers a few times to the tool “ImportGenomicsDB”, when actually the command seems to be called “GenomicsDBImport”.

Thanks

Greg

From shlee on 2018-07-26

Hi @water111,

As you point out, yes, indel realignment tools are not in GATK4. Indel realignment used to be a part of the Best Practices. For recent higher quality data and for workflows that use reassembly-based callers, it’s been deemed redundant. You can read more about considerations in [Blog#7847](https://gatkforums.broadinstitute.org/gatk/discussion/7847/changing-workflows-around-calling-snps-and-indels).

If you would like to continue using indel realignment tools, they are available in GATK3.

From Sheila on 2018-07-26

@water111

Hi,

Have a look at [this blog post](https://software.broadinstitute.org/gatk/blog?id=7847).

-Sheila

From shlee on 2018-07-26

Hi @gregputzel,

Thanks for pointing this out. I’ve corrected ImportGenomicsDB —> GenomicsDBImport in the document. Cheers.

From OliviaChen on 2018-08-01

Hi !

I am trying to use the GATK4 to call SNP&Indel in my patient samples. I used both single calling strategy and joint calling strategy. Naturally, many same variants can be found through both two strategies, while different variants can also be found in the two result respectively.

I conducted the next analysis and Sanger sequencing for the candidate variants found in that different part, and I confirmed some pathogenic variants, which were called only by single calling strategy or joint calling strategy.

This phenomenon confused me very much! If I only conduct single calling strategy or joint calling strategy, I will missed some candidate pathogenic variants ?!

Excuse me, do you know the cause of this phenomenon? Have you ever looked at different variants called out in the two strategies and made statistical study ? Is there a better recommendation for variants calling strategy?

My problem may be a little complicated, I hope you can understand it.

Looking forward to your reply. Thank you!

From Sheila on 2018-08-06

@OliviaChen

Hi,

Can you post some example sites that are missed by each?

Thanks,

Sheila

From OliviaChen on 2018-08-07

@Sheila

Hi! Thanks for your reply.

We analyzed sequencing data of a panel of several targeted genes. After running the pipeline of Best Practices, we called 167 variants out by both single calling and joints calling.

One variant of Chr4: 1803571C>G was only called out by joint calling.

28 variants were only called out by single calling, including but not limited to – chr1: 218610689G>T, chr4: 1803714G>A, chr5: 174152020G>A, chr9: 101891311A>T, chr15: 67358545G>A, chrX: 68060300C>A…

Here are some example sites different in two calling strategies. (In order to keep the information confidential, I’ve given only a few examples. We can contact privately if necessary.)

Additionally, in these days waiting for the reply, someone reminded me that we used MultipSeq® (Multiple PCR targeted capture sequencing), so the procedure of “mark duplicates” in Best Practice could be skipped. I accepted this advice and amended my pipeline.

After amendment, 205 variants were called out by both single calling and joints calling.

5 variants were only called out by single calling, while no variant were only called out by joint calling.

In this new pipeline, it seems that joint calling refined the results called by single calling, **but here is a new question: How should I use the optional step, such as mark duplicates and BQSR, when I analyzed the sequencing data of MultipSeq®?

Do you have any suggestions for dealing with MultipSeq® results?**

Thanks a lot!

Olivia Chen

From sanjeevksh on 2018-08-07

Sheila sprocha should the key value examples given for GATK hard filtering above have opposite operator signs then shown for the following flags:

MQRankSum

ReadPosRankSum

SOR

Regards,

Sanjeev

From Sheila on 2018-08-10

@OliviaChen

Hi Olivia,

>How should I use the optional step, such as mark duplicates and BQSR, when I analyzed the sequencing data of MultipSeq®? Do you have any suggestions for dealing with MultipSeq® results?

I am not familiar with MulitpSeq, but it seems it is amplicon based capture. So, yes in this case you should not mark duplicates, as all reads start and stop in same position and will be marked as duplicates. But, you should still run BQSR step.

-Sheila

From woodwordf_aa on 2018-08-11

Is there any Best Practices Workflow for gene panel germline variants calling ?

I started to use GATK not long ago. My lab is doing research on some specific inheritable diseases, so instead of whole genome/exome sequencing the most of the data is from small gene panels ( < 10 genes ). But according to the scripts introduction given at https://github.com/gatk-workflows/gatk4-germline-snps-indels the two workflows do not support gene panels

My questions are that

1, Is it correct (accurate enough) to try to call short variants on gene panel data.

2, Is there any official workflow for gene panels ? or I have to create my own custom workflow .

From Sheila on 2018-08-12

@sanjeevksh

Hi Sanjeev,

Have a look at [this article](https://software.broadinstitute.org/gatk/documentation/article?id=11069).

-Sheila

From OliviaChen on 2018-08-13

@woodwordf_aa

It is really coincidental that our researches are similar in the part of targeted gene panel. I’m also looking forward to an official workflow for gene panels.

At the same time, I want to communicate with you.

What is your gene capture technique? What’s the depth of your sequencing data?

I’ve been wondering recently whether different capture technologies require different processes to be used.

- OliviaChen

From Sheila on 2018-08-14

@woodwordf_aa

Hi,

The issue with gene panels is that the filtering step will not work properly because there will not be enough variants to create a good model in VQSR. I think you should be fine running the pre-processing steps and the joint variant calling steps (HaplotypeCaller/GenotypeGVCFs), but then you will need to use [hard filtering](https://software.broadinstitute.org/gatk/documentation/article?id=11069) instead of VQSR.

-Sheila

From sanjeevksh on 2018-08-15

@Sheila

Thanks for your help,

Sanjeev

From woodwordf_aa on 2018-08-16

@Sheila Thanks for the replying. By the way, you said “there will not be enough variants …”, so how many variants would be enough ? Is it possible to feed GATK more resource files (gatk VariantRecalibrator -resource ) to overcome this obstacle ?

OliviaChen My lab outsourced the sequencing library building to a commercial company in GuangZhou China. You can google 'SeqCares Pro Lung Cancer HotSpot Panel' to know more about the panel ( I have the panel operation manual but in Chinese ). The panel support 11 genes and 20000x coverage, but we are about to first do a trial run, so only 2000x coverage. You can contact with me by owood2student.cccs.edu

From woodwordf_aa on 2018-08-17

================UPDATE================

I think I got the the answer to the first question here. https://gatkforums.broadinstitute.org/gatk/discussion/11579/vqsr-cnn-filtering-for-small-100-gene-panels

From Sheila on 2018-08-22

@woodwordf_aa

Hi,

Glad you found one answer :smile:

> Is it possible to feed GATK more resource files (gatk VariantRecalibrator -resource ) to overcome this obstacle ?

No, it is the number of variants in your own VCF that matters. Adding extra resource files will not help.

-Sheila

From phh on 2018-08-23

Hi,

I tried to run GenotypeGVCFs after combining ~30 GVCF data of each chromosome by GenomicsDBImport. The consolidating process was completed but I got the error message as follows when running GenotypeGVCFs:

12:33:57.406 INFO NativeLibraryLoader – Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.0.7.0-local.jar!/com/intel/gkl/native/libgkl_compression.so

12:33:57.635 INFO GenotypeGVCFs – ——————————————————————————————

12:33:57.637 INFO GenotypeGVCFs – The Genome Analysis Toolkit (GATK) v4.0.7.0

12:33:57.640 INFO GenotypeGVCFs – For support and documentation go to https://software.broadinstitute.org/gatk/

12:33:57.645 INFO GenotypeGVCFs – Executing as phhsieh@nid00394 on Linux v4.4.73-5.1_4.0.141-cray_ari_c amd64

12:33:57.647 INFO GenotypeGVCFs – Java runtime: OpenJDK 64-Bit Server VM v1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11

12:33:57.648 INFO GenotypeGVCFs – Start Date/Time: August 23, 2018 12:33:57 PM PDT

12:33:57.648 INFO GenotypeGVCFs – ——————————————————————————————

12:33:57.650 INFO GenotypeGVCFs – ——————————————————————————————

12:33:57.651 INFO GenotypeGVCFs – HTSJDK Version: 2.16.0

12:33:57.652 INFO GenotypeGVCFs – Picard Version: 2.18.7

12:33:57.652 INFO GenotypeGVCFs – HTSJDK Defaults.COMPRESSION_LEVEL : 2

12:33:57.653 INFO GenotypeGVCFs – HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false

12:33:57.655 INFO GenotypeGVCFs – HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true

12:33:57.657 INFO GenotypeGVCFs – HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false

12:33:57.658 INFO GenotypeGVCFs – Deflater: IntelDeflater

12:33:57.658 INFO GenotypeGVCFs – Inflater: IntelInflater

12:33:57.659 INFO GenotypeGVCFs – GCS max retries/reopens: 20

12:33:57.660 INFO GenotypeGVCFs – Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes

12:33:57.661 INFO GenotypeGVCFs – Initializing engine

terminate called after throwing an instance of ‘std::length_error‘ what(): vector::_M_default_append

Using GATK jar /gatk/gatk-package-4.0.7.0-local.jar

Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx8g -jar /gatk/gatk-package-4.0.7.0-local.jar GenotypeGVCFs -R ref_genome.fasta -V gendb://my_database_chr2 -O output.chr2.vcf.gz

By googling it, it seems someone also reported something similar recently (https://github.com/broadinstitute/gatk/issues/5024) and I’m wondering if I should just tried different versions of GATK? The version of GATK used is v4.0.7.0. Thanks.

From Sheila on 2018-09-04

@phh

Hi,

It looks like that issue is still open and not yet fixed. Can you post this issue in that ticket? The developers may have some better insight.

Thanks,

Sheila

From phh on 2018-10-03

@Sheila

I think the latest version v4.0.9.0 fix the issue. GenotypeGVCFs now works fine after CombineGVCFs. Thanks.

From shlee on 2018-10-03

Great to hear updating to the latest version of GATK solved your issue @phh. Sheila’s moved on to [greener pastures](https://software.broadinstitute.org/gatk/blog?id=12887) fyi, in case you were hoping to hear from her.

From oneillkza on 2018-12-11

What is the recommended best practice for running a single genome?

This workflow ( [and the ones on GitHub](https://github.com/gatk-workflows/gatk4-germline-snps-indels “https://github.com/gatk-workflows/gatk4-germline-snps-indels”) ) seem to imply generating GVCFs, running the step to combine GVCFs, and running joint genotyping. This seems like a lot of unnecessary overhead. Would it be better to hack the WDL to do something like HaplotypeCaller -> VQSR (SNPs) -> VQSR (indels) -> applyVQSR ?

From Katie on 2019-03-29

Hi,

I am also wondering what are the recommended best practices for calling on a single genome. After HaplotypeCaller and generating an all-sites gVCF, is it still recommended by GATK to run GVCFs?

Thank you!

From Begali on 2019-04-16

Sheila

Geraldine_VdAuwera

according to this workflow https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145

Should I apply also refine genotype (samples are not from same family-trio) , and Evaluate call sets after filtering with VQSR…. I do not get the propose of Evaluate call set at last step after annotation genes with such as (Clinvar.vcf or Cancer.bed) …

Thanks for your attention in advance :) …

From omw90 on 2019-05-02

Hi Sheila Geraldine_VdAuwera

Similar to oneillkza and Katie, I was also hoping someone might be able to comment on recommendations for running the pipeline on a single sample to call germline variants. The current description of the pipeline seems to suggest it is only amenable to calling variants in multiple samples (at least the way I read it). After running HaplotypeCaller, would one then proceed directly to perform a hard filter on the genotype calls?

I was also hoping someone might be able to comment on when it is appropriate (and not appropriate) to perform the joint variant calling step on a set of samples. For example, should the data for these samples all have been generated using the same assay, etc.

Thanks in advance for your help,

From Haseung on 2019-10-01

Hi,

I’ve tried joint genotyping process with ‘GenotypeGVCFs’ tool in GATK v4.1.2 and the output has generated without any errors. However, there is something weird that I couldn’t understand. In the vcf file, there exists two types of GT separator, ‘|’ and ‘/’.

```

1 38 . C CA 10745.24 . AC=92;AF=0.920;AN=100;DP=210;ExcessHet=0.0000;FS=0.000;InbreedingCoeff=0.4128;MLEAC=106;MLEAF=1.00;MQ=36.67;QD=25.36;SOR=10.626 GT:AD:DP:GQ:PGT:PID:PL:PS 1|1:0,19:19:69:1|1:38_C_CA:996,69,0:38 1|1:0,1:1:6:1|1:38_C_CA:81,6,0:38 1|1:0,24:24:75:1|1:38_C_CA:1110,75,0:38 1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38 1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38 1|1:0,19:19:63:1|1:38_C_CA:926,63,0:38 1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38 1|1:0,5:5:18:1|1:38_C_CA:261,18,0:38 1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38 1|1:0,2:2:9:1|1:38_C_CA:126,9,0:38 1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38 1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38 1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38 1|1:0,3:3:12:1|1:38_C_CA:171,12,0:38 1|1:0,1:1:6:1|1:38_C_CA:90,6,0:38 1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38 0/0:1,0:1:3:.:.:0,3,28 1|1:0,6:6:18:1|1:38_C_CA:270,18,0:38 ./.:2,0:2:.:.:.:0,0,./.:0,0:0:.:.:.:0,0,0 1|1:0,4:4:18:1|1:38_C_CA:261,18,0:38 ./.:0,0:0:.:.:.:0,0,0 1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38 1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38 ./.:0,0:0:.:.:.:0,0,1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38 1|1:0,3:3:12:1|1:38_C_CA:171,12,0:38 1|1:0,5:5:15:1|1:38_C_CA:225,15,0:38 1|1:0,1:1:6:1|1:38_C_CA:90,6,0:38 ./.:0,0:0:.:.:.:0,0,0 1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38 1|1:0,5:5:15:1|1:38_C_CA:225,15,0:38 1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38 1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38 ./.:2,0:2:.:.:.:0,0,0 1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38 0/0:1,0:1:3:.:.:0,3,34 ./.:0,0:0:.:.:.:0,0,0 0/0:1,0:1:3:.:.:0,3,35 1|1:0,5:5:15:1|1:38_C_CA:225,15,0:38 1|1:0,5:5:18:1|1:38_C_CA:270,18,0:38 1|1:0,6:6:18:1|1:38_C_CA:270,18,0:38 1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38 1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38 1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38 1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38 1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38 1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38 0/0:1,0:1:3:.:.:0,3,34 1|1:0,2:2:6:1|1:38_C_CA:90,6,0:38 1|1:0,3:3:12:1|1:38_C_CA:171,12,0:38 1|1:0,3:3:9:1|1:38_C_CA:135,9,0:38 1|1:0,5:5:21:1|1:38_C_CA:306,21,0:38 1|1:0,1:1:6:1|1:38_C_CA:90,6,0:38 1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38 1|1:0,5:5:15:1|1:38_C_CA:225,15,0:38 1|1:0,4:4:12:1|1:38_C_CA:180,12,0:38

```

I couldn’t find any options that correct this situation in the tool documentations.

Looking forward to your reply. Thank you!

From YingLiu on 2019-12-06

Hi , @Geraldine_VdAuwera could you tell what tools you use to draw this workflow ? it looks nice …![](https://us.v-cdn.net/5019796/uploads/editor/q3/jvt5o5137rha.png “”)

it’s drawed by Visio ?

From Geraldine_VdAuwera on 2019-12-06

Hi @YingLiu, I used Adobe Illustrator for this image.

From YingLiu on 2019-12-09

@Geraldine_VdAuwera thank you ! it looks so nice

From Geraldine_VdAuwera on 2019-12-09

Thank you for your kind words, @YingLiu

Report abuse