When should I restrict my analysis to specific intervals

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by GATK_Team

on 2017-12-28

This document covers the reasoning behind the use of genomic intervals. If you're looking for instructions on how to use intervals in practice, including argument details and supported formats, please see this doc.

Depending on what you're trying to do, there are many reasons why you might want to tell a tool to operate on a subset of genomic regions only. We distinguish four main types of reasons for doing so:

You want to run a quick test on a subset of data (often used in troubleshooting)
You want to parallelize execution of an analysis across genomic regions
You need to exclude regions that have bad or uninformative data where a tool is getting stuck
The analysis you're running should only take data from those subsets due to how the underlying algorithm works

The first three should be fairly self-explanatory, but let's go into a bit more detail on the fourth one.

In a nutshell

- Whole genome analysis: Intervals are not required but they can help speed up analysis by eliminating "difficult" regions and enabling parallelism

- Exome analysis and other targeted sequencing: You must provide the list of targets, with padding, to exclude off-target noise. This will also speed up analysis and enable parallelism.

Whole genome analysis

It is not strictly necessary to restrict analysis to intervals when working with whole genomes, since presumably you're interested in all of it. However, from a technical perspective, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere) where you know the data is not reliable or is very messy, causing excessive slowdowns. In addition, defining whole-genome intervals allows you to parallelize execution across intervals using the scatter gather mode of parallelism.

We share the lists of "good" whole-genome intervals that we use in our production pipelines for human analysis in our resource bundle (see Download page).

Exome analysis and other targeted sequencing

By definition, exome sequencing and other targeted sequencing data don’t cover the entire genome, so most analyses can be restricted to just the capture targets (genes or exons) to save processing time and enable scatter gather parallelism. In addition, there are some processing steps, such as BQSR, that should be restricted to the capture targets in order to eliminate off-target sequencing data, which is uninformative and is a source of noise.

You should use the list of target intervals that corresponds to the library preparation method that was used to generate the data. If you're working with exome sequencing data that was prepared by someone else, you'll need to find out what kit was used; the kit manufacturers typically provide the lists of intervals that correspond to their kits on their website. We cannot provide you with a suitable interval lists unless you are sure that your data was sequenced at the Broad.

Important notes:

Whatever you end up using intervals for, keep this in mind: for tools that output a BAM or VCF file, the output file will only contain data from the intervals you specified. Any data that falls outside these intervals will be lost to downstream analysis.

In general we recommend adding some padding to the intervals in order to include the flanking regions (typically ~100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use a list of intervals.

You will have noticed by now that we do not provide detailed guidelines for which tool should or should not use an interval list in this article. For tool-by-tool recommendations, please see the example commands in the individual tool docs; they show the most common recommended usage for each. See also the Best Practices documentation for up to date implementation notes.

From MehulS on 2019-01-22

I’ve subsetted 80-90 low coverage illumina paired-end WGS samples into intervals containing around 100-200 genes. Is it advisable to run VQSR on these samples ? What is the threshold for the minimum amount of variants needed for VQSR to run ? I aim to discover common variants in my population within specific targets.

Using GATK 4.0.12