created by GATK_Team
on 2017-12-24
Interval lists define subsets of genomic regions, sometimes even just individual positions in the genome. You can provide GATK tools with intervals or lists of intervals when you want to restrict them to operating on a subset of genomic regions. There are four main types of reasons for doing so:
Regarding the latter case, see the Best Practices workflow recommendations and tool example commands for guidance regarding when to restrict analysis to intervals.
Arguments for specifying and modifying intervals are provided by the engine and can be applied to most of not all tools. The main arguments you need to know about are the following:
-L
/ --intervals
allows you to specify an interval or list of intervals to include.-XL
/ --exclude-intervals
allows you to specify an interval or list of intervals to exclude.-ip
/ --interval-padding
allows you to add padding (in bp) to the intervals you include.-ixp
/ --interval-exclusion-padding
allows you to add padding (in bp) to the intervals you exclude.By default the engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by specifying an alternate interval merging rule (see --interval-merging-rule
in the Tool Docs).
The syntax for using -L
is as follows; it applies equally to -XL
:
-L chr20
for contig chr20.-L chr20:1-100
for contig chr20, positions 1-100.-L intervals.list
(or intervals.interval_list
, or intervals.bed
) when specifying a text file containing intervals (see supported formats below).-L variants.vcf
when specifying a VCF file containing variant records; their genomic coordinates will be used as intervals.If you want to provide several intervals or several interval lists, just pass them in using separate -L
or -XL
arguments (you can even use both of them in the same command). You can use all the different formats within the same command line. By default, the GATK engine will take the UNION of all the intervals in all the sets. This behavior can be modified by specifying an alternate interval set rule (see --interval-set-rule
in the Tool Docs).
GATK supports several types of interval list formats: Picard-style .interval_list
, GATK-style .list
, BED files with extension .bed
, and VCF files. The intervals MUST be sorted by coordinate (in increasing order) within contigs; and the contigs must be sorted in the same order as in the sequence dictionary. This is require for efficiency reasons.
Picard-style interval files have a SAM-like header that includes a sequence dictionary. The intervals are given in the form <chr> <start> <stop> + <target_name>
, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).
@HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens @SQ SN:2 LN:243199373 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:a0d9851da00400dec1098a9255ac712e SP:Homo Sapiens 1 30366 30503 + target_1 1 69089 70010 + target_2 1 367657 368599 + target_3 1 621094 622036 + target_4 1 861320 861395 + target_5 1 865533 865718 + target_6
This is the preferred format because the explicit sequence dictionary safeguards against accidental misuse (e.g. apply hg18 intervals to an hg19 BAM file). Note that this file is 1-based, not 0-based (the first position in the genome is position 1).
This is a simpler format, where intervals are in the form <chr>:<start>-<stop>
, and no sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <chr>
part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <chr>:<start>-<stop>
and <chr>
can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.
We also accept the widely-used BED format, where intervals are in the form <chr> <start> <stop>
, with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats (e.g. if you're cooking up a custom interval list derived from a file in a 1-based format) should be offset by 1. The GATK engine recognizes the .bed
extension and interprets the coordinate system accordingly.
Yeah, I bet you didn't expect that was a thing! It's very convenient. Say you want to redo a variant calling run on a set of variant calls that you were given by a colleague, but with the latest version of HaplotypeCaller. You just provide the VCF, slap on some padding on the fly using e.g. -ip 100
in the HC command, and boom, done. Each record in the VCF will be interpreted as a single-base interval, and by adding padding you ensure that the caller sees enough context to reevaluate the call appropriately.
So where do those intervals come from? It depends a lot on what you're working with (everyone's least favorite answer, I know). The most important distinction is the sequencing experiment type: is it whole genome, or targeted sequencing of some sort?
For exomes and similarly targeted data types, the interval list should correspond to the capture targets used for the library prep, and is typically provided by the prep kit manufacturer (with versions for each ref genome build of course).
We make our exome interval lists available, but be aware that they are specific to the custom exome targeting kits used at the Broad. If you got your sequencing done somewhere else, you should seek to get the appropriate intervals list from the sequencing provider.
For whole genome sequence, the intervals lists don’t depend on the prep (since in principle you captured the “whole genome”) so instead it depends on what regions of the genome you want to blacklist (e.g. centromeric regions that waste your time for nothing) and how the reference genome build enables you to cut up regions (separated by Ns) for scatter-gather parallelizing.
We make our WGS interval lists available, and the good news is that, as long as you're using the same genome reference build as us, you can use them with your own data even if it comes from somewhere else -- assuming you agree with our decisions about which regions to blacklist! Which you can examine by looking at the intervals themselves. However, we don't currently have documentation on their provenance, sorry -- baby steps.
Updated on 2017-12-24
From ehodis on 2018-02-22
Is there a Google bucket where I can find an hg19 WGS interval list file for use with “help-gatk/Somatic-CNVs-GATK4”?
From Sheila on 2018-02-28
@ehodis
Hi,
Unfortunately, we do not provide one for hg19.
-Sheila
From hexy on 2018-04-08
Hi @ehodis ,
I want hg19 WGS interval list file also, have you found that?
From registered_user on 2018-09-07
A link to where these list can be found would be helpful! Or better yet, a tutorial on how to create an interval list from a reference genome.
From Sheila on 2018-09-13
@registered_user
Hi,
We don’t provide hg19 intervals. I don’t know about a tutorial on how to make your own interval list, but it may come in the future.
-Sheila
From slee on 2018-09-18
Hi ehodis,
ehodis, and @registered_user,
For running the Somatic-CNVs-GATK4 workflow on WGS data, you can start by simply passing a .list or .intervals file containing all of the autosomes (1, 2, 3, etc.) This will run the CNV workflow on the autosomes, covering them with bins of size `bin_length`. For reasons discussed in the workflow and tool documentation, extra care must be taken to run the workflow on the sex chromosomes.
For running this workflow on WES data, we do not provide intervals—-you should use whatever interval list is appropriate for the capture kit used to generate your data.
From pkhadka on 2018-09-20
Hi, it says here that you make your WGS interval list available. Can you point to me where I might be able to find this? I am using hg18 data.
From alongalor on 2018-10-22
I am using hg19 data and would find a WGS interval list helpful too…
From kclem on 2018-12-29
Bump this. Can you add the link to the interval lists for download? For hg38?
From sen_guo1 on 2019-01-30
above are the links that can get the interval_list file, while when use thefile < somatic-b37_whole_exome_agilent_1.1_refseq_plus_3_boosters.Homo_sapiens_assembly19.baits.interval_list > , GATK says
Homo_sapiens_assembly19.targets.interval_list is malformed:Interval file could not be parsed in any supported format
Homo_sapiens_assembly19.targets.interval_list has an invalid interval : 1:30366-30503 + target_1
Any advice would be great appreciate!
From sen_guo1 on 2019-01-30
console.cloud.google.com/storage/browser/gatk-best-practices/somatic-b37/?project=broad-dsde-outreach
console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38?project=broad-dsde-outreach
From Tiffany_at_Broad on 2019-02-18
Here is a link to the [gatk-test-data bucket with interval lists for b37 and hg38](https://console.cloud.google.com/storage/browser/gatk-test-data/intervals/?project=broad-dsde-outreach&organizationId=548622027621 “gatk-test-data bucket with interval lists for b37 and hg38”)
From pcusco on 2019-02-26
Hi,
I am getting a Java NullPointerException when I try to use .bed file with the —interval-list/-L option.
This doesn’t happen at chromosome boundaries or the like. And doesn’t happen if I take only the first ~20k lines of the bed.
Also, I don’t get any error if I use a different, smaller bed file with the same chromosome order as the problematic file, so I ruled out this as the source of the error.
Is there a maximum number of lines or bp? Anyone knows why is this happening?
Thanks!
From RFRichholt on 2019-03-12
Can you describe the “inclusiveness” of these intervals better. For example, to represent a single base interval at chr1 position 100 would it be `chr1:100-100` or `chr1:100-101`?
From Rosmaninho on 2019-03-28
> @Tiffany_at_Broad said:
> Here is a link to the [gatk-test-data bucket with interval lists for b37 and hg38](https://console.cloud.google.com/storage/browser/gatk-test-data/intervals/?project=broad-dsde-outreach&organizationId=548622027621 “gatk-test-data bucket with interval lists for b37 and hg38”)
I suppose the recommended file is b37_wgs_consolidated_calling_intervals.list ?
Does this exclude highly repetitive regions, it’s not present on the resource bundle so it would be nice to have some more info about it…
From Rosmaninho on 2019-04-02
While running HaplotypeCaller by defining a while loop pointing to the Broad file of interval lists I got the following error on -L.
broad_interval_list=$interval_file/intervals_b37_wgs_consolidated_calling_intervals.list while IFS=’‘ read -r line || [[ -n “$line” ]]; do echo “Text read from file: $line” | \ ls -d $bam_files_bqsr/${names[${SLURM_ARRAY_TASK_ID}]}.bam | srun shifter —image=broadinstitute/gatk:latest gatk —java-options ‘-Xmx32G’ HaplotypeCaller -I=$bam_files_bqsr/${names[${SLURM_ARRAY_TASK_ID}]}.bam \ -O=$output/${names[${SLURM_ARRAY_TASK_ID}]}_${line}_50_wgs.g.vcf.gz \ -ERC=GVCF -L=${line} -ip=50 -R=$genome/human_g1k_v37.fasta —native-pair-hmm-threads=$SLURM_CPUS_PER_TASK ; done < “$broad_interval_list”
``A USER ERROR has occurred: No value found for tagged argument: L=1:10001-29878082
Also, is it okay to include a string 1:10001-29878082 in the name of the output file? I find having : in a file name not be a good practice.
From Rosmaninho on 2019-04-04
Well, just solved this minor niggle, no worries.
From bhanuGandham on 2019-04-08
@Rosmaninho
b37_wgs_consolidated_calling_intervals.list is a list of the intervals used for WGS short variant calling (as seen in the [$5G pipeline](https://github.com/gatk-workflows/five-dollar-genome-analysis-pipeline “$5G pipeline”)), which are available as this “consolidated” list (ie in a single file) and as a scattered list of lists. And this is the b37 reference version.
From Rosmaninho on 2019-04-09
Thank you for the reply Bhanu!!
Is there more information on the regions removed from this list? Repetitive regions, telomeres, etc?…
I used it to treat my WGS data instead of using the full chromossomes and it worked quite well (created a list of lists from it). :)
From javis on 2019-07-26
could I use the interval wgs_calling_regions.hg38.interval_list to filter bad regions with API SelectVariants?
`gatk SelectVariants -L wgs_calling_regions.hg38.interval_list`
wgs_calling_regions.hg38.interval_list download from https://console.cloud.google.com/storage/browser/_details/gatk-test-data/intervals/wgs_calling_regions.hg38.interval_list