Collected FAQs about interval lists

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Geraldine_VdAuwera

on 2012-08-11

1. Can GATK tools be restricted to specific intervals instead of processing the entire reference?

Absolutely. Just use the -L argument to provide the list of intervals you wish to run on. Or you can use -XL to exclude intervals, e.g. to blacklist genome regions that are problematic.

2. What file formats does GATK support for interval lists?

GATK supports several types of interval list formats: Picard-style .interval_list, GATK-style .list, BED files with extension .bed, and VCF files.

A. Picard-style .interval_list

Picard-style interval files have a SAM-like header that includes a sequence dictionary. The intervals are given in the form <chr> <start> <stop> + <target_name>, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).

@HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens @SQ SN:2 LN:243199373 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:a0d9851da00400dec1098a9255ac712e SP:Homo Sapiens 1 30366 30503 + target_1 1 69089 70010 + target_2 1 367657 368599 + target_3 1 621094 622036 + target_4 1 861320 861395 + target_5 1 865533 865718 + target_6

This is the preferred format because the explicit sequence dictionary safeguards against accidental misuse (e.g. apply hg18 intervals to an hg19 BAM file). Note that this file is 1-based, not 0-based (the first position in the genome is position 1).

B. GATK-style .list or .intervals

This is a simpler format, where intervals are in the form <chr>:<start>-<stop>, and no sequence dictionary is necessary. This file format also uses 1-based coordinates. Note that only the <chr> part is strictly required; if you just want to specify chromosomes/ contigs as opposed to specific coordinate ranges, you don't need to specify the rest. Both <chr>:<start>-<stop> and <chr> can be present in the same file. You can also specify intervals in this format directly at the command line instead of writing them in a file.

C. BED files with extension .bed

We also accept the widely-used BED format, where intervals are in the form <chr> <start> <stop>, with fields separated by tabs. However, you should be aware that this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats (e.g. if you're cooking up a custom interval list derived from a file in a 1-based format) should be offset by 1. The GATK engine recognizes the .bed extension and interprets the coordinate system accordingly.

D. VCF files

Yeah, I bet you didn't expect that was a thing! It's very convenient. Say you want to redo a variant calling run on a set of variant calls that you were given by a colleague, but with the latest version of HaplotypeCaller. You just provide the VCF, slap on some padding on the fly using e.g. -ip 100 in the HC command, and boom, done. Each record in the VCF will be interpreted as a single-base interval, and by adding padding you ensure that the caller sees enough context to reevaluate the call appropriately.

3. Is there a required order of intervals?

Yes, thanks for asking. The intervals MUST be sorted by coordinate (in increasing order) within contigs; and the contigs must be sorted in the same order as in the sequence dictionary. This is for efficiency reasons.

4. Can I provide multiple sets of intervals?

Sure, no problem -- just pass them in using separate -L arguments. You can use all the different formats within the same command line. By default, the GATK engine will take the UNION of all the intervals in all the sets. This behavior can be modified by setting an interval_set rule.

5. How will GATK handle intervals that abut or overlap?

Very gracefully. By default the GATK engine will merge any intervals that abut (i.e. they are contiguous, they touch without overlapping) or overlap into a single interval. This behavior can be modified by setting an interval_merging rule.

6. What's the best way to pad intervals?

You can use the -ip engine argument to add padding on the fly. No need to produce separate padded targets files. Sweet, right?

Note that if intervals that previously didn't abut or overlap before you added padding now do so, by default the GATK engine will merge them as described above. This behavior can be modified by setting an interval_merging rule.

Updated on 2016-10-14

From prepagam on 2014-04-22

If I was only interested in calling variants in a set of neutral regions, I wonder if there are any negative implications to intersecting my bam with a bed file of these regions PRIOR to gatk. i.e. doing this rather than using the genomics intervals that GATK offers. For me this is preferable for various storage reasons, but perhaps this has some unknown side effect with GaTK.

From Geraldine_VdAuwera on 2014-04-23

No problem at all, you can use whatever intervals you want. This may influence the expected Ti/Tv ratio, so keep that in mind when you analyze your callset, but it shouldn’t have any effect on the quality of results.

From eflannery on 2015-06-09

Hi Geraldine, It seems like there is a minimum size the interval in the interval list needs to be to get outputted in the Diagnose Targets walker. Do you know this minimum? Is it default or calculated each time? Is there a way to change it?

Thanks!

Erika

From Geraldine_VdAuwera on 2015-06-09

Hi @eflannery,

I just looked at the code and didn’t find any hardcoded limits. The only limitation that I’m aware of is that intervals must be non-null (ie not zero-length). Why do you think there’s a limit?

From eflannery on 2015-06-09

When I run Diagnose Targets there are intervals that are not present in the output file that are present in the interval_list file. All of the intervals that are excluded, are very small, <500bp. I only assumed this is why they were not included. Shouldn’t every interval in interval_list be included in the output of diagnose Targets?

Thanks!

Erika

From Sheila on 2015-09-01

@eflannery

Hi Erika,

Sorry for the late response. I was going through my old emails and found this! Are you still having an issue with this? Is it possible that the short intervals overlap some other longer intervals and are getting output as part of the longer intervals?

Thanks,

Sheila

From Katie on 2016-01-20

Is there a way to define an interval list by position rather than interval? For example, if I am interested in using SelectVariants, can I query a VCF with a list containing only contig and SNP position? I’ve tried this but seems like I need to define regions rather than positions.

Thank you!

From Katie on 2016-01-20

Sorry to bother, I found that vcftools will filter with a tab-delimited list of chromosome and position with the command:

vcftools —vcf ‘VCFfile’ —positions ‘positions_list’

Cheers,

From Geraldine_VdAuwera on 2016-01-21

You can do this with SelectVariants, sure. You can pass in single positions using either the interval list format or a vcf of sites of interest.

From QazSeDc on 2016-10-07

I’ve had a hard time running DepthOfCoverage with the correct format of interval file.

I tried following the gatk instructions but still wouldn’t work.

Would anyone please give an example for each of the .list .intervals and .interval_list format?

From Geraldine_VdAuwera on 2016-10-07

Please see https://software.broadinstitute.org/gatk/guide/article?id=1204

From QazSeDc on 2016-10-11

Hi @Geraldine_VdAuwera ,

I have tried the [chr] [start] [stop] format with .list .intervals and .interval_list filename extension mentioned in https://software.broadinstitute.org/gatk/guide/article?id=1204 but it wouldn’t work.

I figured the [chr] [start] [stop] format only worked for .bed files and the only time when .list .intervals and .interval_list worked out was to use the [chr]:[start]-[stop] format.

Am I missing something?

From Geraldine_VdAuwera on 2016-10-14

Hi @QazSeDc, I rewrote this article to be more clear about what is supported, what are the requirements and also some of the convenience options that are related to intervals. I hope this helps.

From QazSeDc on 2016-10-25

Thank you @Geraldine_VdAuwera!

This new guild line explains everything clearly!

From biojiangke on 2018-06-22

Hi,

I have a question about the behavior of the interval option in CombineGVCF: I understand it could take standard samtools/GATK format chr:start-end, and BED format, but it also could take the format of chr:pos, as I tried. I would think GATK processes one genomic position in this situation, but instead, I’m getting results up to 5bp from this specified position. Would anyone provide more information about this behavior?

The application behind this is that sometimes we use this type of operation to fetch genotypes across samples with WGS data and compare with results from other genotyping platforms such as SNP chips and amplicons. In this case, the sites to be checked are discrete and scattered across the genome and I had to supply GATK with multiple intervals.

P. S. also posted this in a different thread before finding this one.

Thanks!

Ke

From Sheila on 2018-06-25

@biojiangke

Hi Ke,

I will answer [there](https://gatkforums.broadinstitute.org/gatk/discussion/comment/49851#Comment_49851).

-Sheila

From Jason_Wu on 2018-12-01

Dear GATK team,

I met an error when using a bed file as INTERVAL input at “gatk GenotypeConcordance”(Picard).

Well, it ended up by using “gatk BedToInterval” to get a new Interval file as the INTERVAL input instead.

But here said, “GATK also accept bed file as interval input”.

So I was wondering if it meant that the GATK standard only covers the “original” GATK tools, but not the Picard tools which are also included in the GATK tool list now.

Thanks.

Jason

From shlee on 2018-12-01

Hi @Jason_Wu,

Picard definitely accepts Picard-style intervals lists and GATK accepts both Picard-style as well as BED intervals. Given what you report, I will put a request in for Picard tools called through GATK to also accept BED-format. Thanks for bringing this to our attention.

P.S. Here is the GitHub issue ticket I placed for you: https://github.com/broadinstitute/gatk/issues/5472