Intervals and interval lists

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by GATK_Team

on 2017-12-24

Interval lists define subsets of genomic regions, sometimes even just individual positions in the genome. You can provide GATK tools with intervals or lists of intervals when you want to restrict them to operating on a subset of genomic regions. There are four main types of reasons for doing so:

You want to run a quick test on a subset of data (often used in troubleshooting)
You want to parallelize execution of an analysis across genomic regions
You need to exclude regions that have bad or uninformative data where a tool is getting stuck
The analysis you're running should only take data from those subsets due to how the underlying algorithm works

Regarding the latter case, see the Best Practices workflow recommendations and tool example commands for guidance regarding when to restrict analysis to intervals.

Interval-related arguments and syntax

Arguments for specifying and modifying intervals are provided by the engine and can be applied to most of not all tools. The main arguments you need to know about are the following:

-L / --intervals allows you to specify an interval or list of intervals to include.
-XL / --exclude-intervals allows you to specify an interval or list of intervals to exclude.
-ip / --interval-padding allows you to add padding (in bp) to the intervals you include.
-ixp / --interval-exclusion-padding allows you to add padding (in bp) to the intervals you exclude.

By default the engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by specifying an alternate interval merging rule (see --interval-merging-rule in the Tool Docs).

The syntax for using -L is as follows; it applies equally to -XL:

-L chr20 for contig chr20.
-L chr20:1-100 for contig chr20, positions 1-100.
-L intervals.list (or intervals.interval_list, or intervals.bed) when specifying a text file containing intervals (see supported formats below).
-L variants.vcf when specifying a VCF file containing variant records; their genomic coordinates will be used as intervals.

If you want to provide several intervals or several interval lists, just pass them in using separate -L or -XL arguments (you can even use both of them in the same command). You can use all the different formats within the same command line. By default, the GATK engine will take the UNION of all the intervals in all the sets. This behavior can be modified by specifying an alternate interval set rule (see --interval-set-rule in the Tool Docs).

Supported interval list formats

GATK supports several types of interval list formats: Picard-style .interval_list, GATK-style .list, BED files with extension .bed, and VCF files. The intervals MUST be sorted by coordinate (in increasing order) within contigs; and the contigs must be sorted in the same order as in the sequence dictionary. This is require for efficiency reasons.

A. Picard-style .interval_list

Picard-style interval files have a SAM-like header that includes a sequence dictionary. The intervals are given in the form <chr> <start> <stop> + <target_name>, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).

@HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens @SQ SN:2 LN:243199373 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:a0d9851da00400dec1098a9255ac712e SP:Homo Sapiens 1 30366 30503 + target_1 1 69089 70010 + target_2 1 367657 368599 + target_3 1 621094 622036 + target_4 1 861320 861395 + target_5 1 865533 865718 + target_6

This is the preferred format because the explicit sequence dictionary safeguards against accidental misuse (e.g. apply hg18 intervals to an hg19 BAM file). Note that this file is 1-based, not 0-based (the first position in the genome is position 1).

B. GATK-style .list or .intervals

This is a simpler format, where intervals are in the form <chr>:<start>-<stop>, and no sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <chr> part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <chr>:<start>-<stop> and <chr> can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.

C. BED files with extension .bed

We also accept the widely-used BED format, where intervals are in the form <chr> <start> <stop>, with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats (e.g. if you're cooking up a custom interval list derived from a file in a 1-based format) should be offset by 1. The GATK engine recognizes the .bed extension and interprets the coordinate system accordingly.

D. VCF files

Yeah, I bet you didn't expect that was a thing! It's very convenient. Say you want to redo a variant calling run on a set of variant calls that you were given by a colleague, but with the latest version of HaplotypeCaller. You just provide the VCF, slap on some padding on the fly using e.g. -ip 100 in the HC command, and boom, done. Each record in the VCF will be interpreted as a single-base interval, and by adding padding you ensure that the caller sees enough context to reevaluate the call appropriately.

Obtaining suitable interval lists

So where do those intervals come from? It depends a lot on what you're working with (everyone's least favorite answer, I know). The most important distinction is the sequencing experiment type: is it whole genome, or targeted sequencing of some sort?

Targeted sequencing (exomes, gene panels etc.)

For exomes and similarly targeted data types, the interval list should correspond to the capture targets used for the library prep, and is typically provided by the prep kit manufacturer (with versions for each ref genome build of course).

We make our exome interval lists available, but be aware that they are specific to the custom exome targeting kits used at the Broad. If you got your sequencing done somewhere else, you should seek to get the appropriate intervals list from the sequencing provider.

Whole genomes (WGS)

For whole genome sequence, the intervals lists don’t depend on the prep (since in principle you captured the “whole genome”) so instead it depends on what regions of the genome you want to blacklist (e.g. centromeric regions that waste your time for nothing) and how the reference genome build enables you to cut up regions (separated by Ns) for scatter-gather parallelizing.

We make our WGS interval lists available, and the good news is that, as long as you're using the same genome reference build as us, you can use them with your own data even if it comes from somewhere else -- assuming you agree with our decisions about which regions to blacklist! Which you can examine by looking at the intervals themselves. However, we don't currently have documentation on their provenance, sorry -- baby steps.

Updated on 2017-12-24

From ehodis on 2018-02-22

Is there a Google bucket where I can find an hg19 WGS interval list file for use with “help-gatk/Somatic-CNVs-GATK4”?

From Sheila on 2018-02-28

@ehodis

Hi,

Unfortunately, we do not provide one for hg19.

-Sheila

From hexy on 2018-04-08

Hi @ehodis ,

I want hg19 WGS interval list file also, have you found that?

From registered_user on 2018-09-07

A link to where these list can be found would be helpful! Or better yet, a tutorial on how to create an interval list from a reference genome.

From Sheila on 2018-09-13

@registered_user

Hi,

We don’t provide hg19 intervals. I don’t know about a tutorial on how to make your own interval list, but it may come in the future.

-Sheila

From slee on 2018-09-18

Hi ehodis, ehodis, and @registered_user,

For running the Somatic-CNVs-GATK4 workflow on WGS data, you can start by simply passing a .list or .intervals file containing all of the autosomes (1, 2, 3, etc.) This will run the CNV workflow on the autosomes, covering them with bins of size `bin_length`. For reasons discussed in the workflow and tool documentation, extra care must be taken to run the workflow on the sex chromosomes.

For running this workflow on WES data, we do not provide intervals—-you should use whatever interval list is appropriate for the capture kit used to generate your data.

From pkhadka on 2018-09-20

Hi, it says here that you make your WGS interval list available. Can you point to me where I might be able to find this? I am using hg18 data.

From alongalor on 2018-10-22

I am using hg19 data and would find a WGS interval list helpful too…

From kclem on 2018-12-29

Bump this. Can you add the link to the interval lists for download? For hg38?

From sen_guo1 on 2019-01-30

above are the links that can get the interval_list file, while when use thefile < somatic-b37_whole_exome_agilent_1.1_refseq_plus_3_boosters.Homo_sapiens_assembly19.baits.interval_list > , GATK says

Homo_sapiens_assembly19.targets.interval_list is malformed:Interval file could not be parsed in any supported format

Homo_sapiens_assembly19.targets.interval_list has an invalid interval : 1:30366-30503 + target_1

Any advice would be great appreciate!

From sen_guo1 on 2019-01-30

console.cloud.google.com/storage/browser/gatk-best-practices/somatic-b37/?project=broad-dsde-outreach

console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38?project=broad-dsde-outreach

From Tiffany_at_Broad on 2019-02-18

Here is a link to the [gatk-test-data bucket with interval lists for b37 and hg38](https://console.cloud.google.com/storage/browser/gatk-test-data/intervals/?project=broad-dsde-outreach&organizationId=548622027621 “gatk-test-data bucket with interval lists for b37 and hg38”)

From pcusco on 2019-02-26

Hi,

I am getting a Java NullPointerException when I try to use .bed file with the —interval-list/-L option.

This doesn’t happen at chromosome boundaries or the like. And doesn’t happen if I take only the first ~20k lines of the bed.

Also, I don’t get any error if I use a different, smaller bed file with the same chromosome order as the problematic file, so I ruled out this as the source of the error.

Is there a maximum number of lines or bp? Anyone knows why is this happening?

Thanks!

From RFRichholt on 2019-03-12

Can you describe the “inclusiveness” of these intervals better. For example, to represent a single base interval at chr1 position 100 would it be `chr1:100-100` or `chr1:100-101`?

From Rosmaninho on 2019-03-28

> @Tiffany_at_Broad said:

> Here is a link to the [gatk-test-data bucket with interval lists for b37 and hg38](https://console.cloud.google.com/storage/browser/gatk-test-data/intervals/?project=broad-dsde-outreach&organizationId=548622027621 “gatk-test-data bucket with interval lists for b37 and hg38”)

I suppose the recommended file is b37_wgs_consolidated_calling_intervals.list ?

Does this exclude highly repetitive regions, it’s not present on the resource bundle so it would be nice to have some more info about it…

From Rosmaninho on 2019-04-02

While running HaplotypeCaller by defining a while loop pointing to the Broad file of interval lists I got the following error on -L.

broad_interval_list=$interval_file/intervals_b37_wgs_consolidated_calling_intervals.list while IFS=’‘ read -r line || [[ -n “$line” ]]; do echo “Text read from file: $line” | \ ls -d $bam_files_bqsr/${names[${SLURM_ARRAY_TASK_ID}]}.bam | srun shifter —image=broadinstitute/gatk:latest gatk —java-options ‘-Xmx32G’ HaplotypeCaller -I=$bam_files_bqsr/${names[${SLURM_ARRAY_TASK_ID}]}.bam \ -O=$output/${names[${SLURM_ARRAY_TASK_ID}]}_${line}_50_wgs.g.vcf.gz \ -ERC=GVCF -L=${line} -ip=50 -R=$genome/human_g1k_v37.fasta —native-pair-hmm-threads=$SLURM_CPUS_PER_TASK ; done < “$broad_interval_list”

``A USER ERROR has occurred: No value found for tagged argument: L=1:10001-29878082

Also, is it okay to include a string 1:10001-29878082 in the name of the output file? I find having : in a file name not be a good practice.

From Rosmaninho on 2019-04-04

Well, just solved this minor niggle, no worries.

From bhanuGandham on 2019-04-08

@Rosmaninho

b37_wgs_consolidated_calling_intervals.list is a list of the intervals used for WGS short variant calling (as seen in the [$5G pipeline](https://github.com/gatk-workflows/five-dollar-genome-analysis-pipeline “$5G pipeline”)), which are available as this “consolidated” list (ie in a single file) and as a scattered list of lists. And this is the b37 reference version.

From Rosmaninho on 2019-04-09

Thank you for the reply Bhanu!!

Is there more information on the regions removed from this list? Repetitive regions, telomeres, etc?…

I used it to treat my WGS data instead of using the full chromossomes and it worked quite well (created a list of lists from it). :)

From javis on 2019-07-26

could I use the interval wgs_calling_regions.hg38.interval_list to filter bad regions with API SelectVariants?

`gatk SelectVariants -L wgs_calling_regions.hg38.interval_list`

wgs_calling_regions.hg38.interval_list download from https://console.cloud.google.com/storage/browser/_details/gatk-test-data/intervals/wgs_calling_regions.hg38.interval_list

Report abuse