created by delangel
on 2012-07-23
These errors occur when the names or sizes of contigs don't match between input files. This is a classic problem that typically happens when you get some files from collaborators, you try to use them with your own data, and GATK fails with a big fat error saying that the contigs don't match.
The first thing you need to do is find out which files are mismatched, because that will affect how you can fix the problem. This information is included in the error message, as shown in the examples below. You'll notice that GATK always evaluates everything relative to the reference.
A very common case we see looks like this:
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths: ##### ERROR contig reads = chrM / 16569 ##### ERROR contig reference = chrM / 16571. ##### ERROR reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM] ##### ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]
First, the error tells us that the mismatch is between the file containing reads, i.e. our BAM file, and the reference:
Input files reads and reference have incompatible contigs
It further tells us that the contig length doesn't match for the chrM contig:
Found contigs with the same name but different lengths: ##### ERROR contig reads = chrM / 16569 ##### ERROR contig reference = chrM / 16571.
This can be caused either by using the wrong genome build version entirely, or using a reference that was hacked from a build that's very close but not identical, like b37 vs hg19, as detailed a bit more below.
We sometimes also see cases where people are using a very different reference; this is especially the case for non-model organisms where there is not yet a widely-accepted standard genome reference build.
Note that the error message also lists the content of the sequence dictionaries that it found for each file, and we see that some contigs in our reference dictionary are not listed in the BAM dictionary, but that's not a problem. If it was the opposite, with extra contigs in the BAM (or VCF), then GATK wouldn't know what to do with the reads from these extra contigs and would error out (even if we try restricting analysis using -L
) with something like this:
#### ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with.
Solution
If you can, simply switch to the correct reference. Note that file names may be misleading, as people will sometimes rename files willy-nilly. Sometimes you'll need to do some detective work to identify the correct reference if you inherited someone else's sequence data.
If that's not an option because you either can't find the correct reference or you absolutely MUST use a particular reference build, then you will need to redo the alignment altogether. Sadly there is no liftover procedure for reads. If you don't have access to the original unaligned sequence files, you can use Picard tools to revert your BAM file back to an unaligned state (either unaligned BAM or FASTQ depending on the workflow you wish to follow).
Special case of b37 vs. hg19
The b37 and hg19 human genome builds are very similar, and the canonical chromosomes (1 through 22, X and Y) only differ by their names (no prefix vs. chr prefix, respectively). If you only care about those, and don't give a flying fig about the decoys or the mitochondrial genome, you could just rename the contigs throughout your mismatching file and call it done, right?
Well... This can work if you do it carefully and cleanly -- but many things can go wrong during the editing process that can screw up your files even more, and it only applies to the canonical chromosomes. The mitochondrial contig is a slightly different length (see error above) in addition to having a different naming convention, and all the other contigs (decoys, herpes virus etc) don't have direct equivalents.
So only try that if you know what you're doing. YMMV.
ERROR MESSAGE: Input files known and reference have incompatible contigs: Found contigs with the same name but different lengths: ERROR contig known = chrM / 16569 ERROR contig reference = chrM / 16571.
Yep, it's just like the error we had with the BAM file above. Looks like we're using the wrong genome build again and a contig length doesn't match. But this time the error tells us that the mismatch is between the file identified as known and the reference:
Input files known and reference have incompatible contigs
We know (trust me) that this is the output of a RealignerTargetCreator command, so the known file must be the VCF file provided through the known
argument. Depending on the tool, the way the file is identified may vary, but the logic should be fairly obvious.
Solution
If you can, you find a version of the VCF file that is derived from the right reference. If you're working with human data and the VCF in question is just a common resource like dbsnp, you're in luck -- we provide versions of dbsnp and similar resources derived from the major human reference builds in our resource bundle (see FAQs for access details).
location: ftp.broadinstitute.org username: gsapubftp-anonymous
If that's not an option, then you'll have to "liftover" -- specifically, liftover the mismatching VCF to the reference you need to work with. The best tool for liftover is Picard's LiftoverVCF.
GATK used to include some liftover utilities (documented below for the record) but we no longer support them.
Liftover procedure with older versions of GATK
This procedure involves three steps:
We provide a script that performs those three steps for you, called liftOverVCF.pl
, which is available in our public source repository -- but you have to check out a version older than 3.4 -- under the 'perl' directory. Instructions for pulling down our source code from github are available here.
The example below shows how you would run the script:
./liftOverVCF.pl \ -vcf calls.b36.vcf \ # input vcf -chain b36ToHg19.broad.over.chain \ # chain file -out calls.hg19.vcf \ # output vcf -gatk gatk_source \ # path to source code -newRef Homo_sapiens_assembly19 \ # path to new reference base name (without extension) -oldRef human_b36_both \ # path to old reference prefix (without extension) -tmp /broad/shptmp [defaults to /tmp] # temp file location (defaults to /tmp)
We provide several chain files to liftover between the major human reference builds, also in our resource bundle (mentioned above) in the Liftover_Chain_Files
directory. If you are working with non-human organisms, we can't help you -- but others may have chain files, so ask around in your field.
Note that if you're at the Broad, you can access chain files to liftover from b36/hg18 to hg19 on the humgen
server.
/humgen/gsa-hpprojects/GATK/data/Liftover_Chain_Files/
Updated on 2016-10-25
From adr1an on 2016-04-19
So I got this error,
ERROR contig reads is named chrM with length 16569
ERROR contig reference is named chrM with length 16571 and MD5 d2ed829b8a1628d16cbeee88e88e39eb.
But I'm quite sure that I'm using the correct reference (hg19). I have inherited someone else's sequence data, and its supposed to be aligned against hg19. Right now I don't have computational power to realign the reads. So I need to do the detective work mentioned in the tutorial. Where do I start? I tried hg38 but not hg18. Will do that.
From Will_Gilks on 2016-04-19
@adr1an I’ve found the name label for the mitochondrial genome can between assemblies. Eg chrM vs mitochrondrail_genome. Also reference meta-data name labels can vary within an assembly version. Eg between fasta and chain file types. Also, it’s quite possible that some databases start the genome at 0bp, and some at 1bp.
From Sheila on 2016-04-20
@adr1an
Hi,
It is best to find out the exact reference the original data was aligned to by asking your collaborators. They may have used a manipulated version of hg19.
-Sheila
From IgnacioSeret on 2016-05-12
I’m having this same problem but both the vcf and the reference fasta are from 1000G. I’m trying to use Fasta Alternate Reference Maker to get some variants from 1000G.
ERROR MESSAGE: Input files variant and sequence have incompatible contigs.
From Sheila on 2016-05-13
@IgnacioSeret
Hi,
Can you please post the Fasta dict file and VCF header?
Thanks,
Sheila
From IgnacioSeret on 2016-05-16
@Sheila
Here is the vcf [http://pastebin.com/mQEgPWVk](http://pastebin.com/mQEgPWVk)
And the fasta dict [http://pastebin.com/DfLXA1u1](http://pastebin.com/DfLXA1u1)
From Sheila on 2016-05-23
@IgnacioSeret
Hi,
The issue is that the VCF file has all the reference contigs, but the fasta dict file only has one contig. You should use the same reference that you created the VCF with.
-Sheila
From lalithav on 2016-10-22
Can you point me to the location of the liftOverVCF.pl script? It is not available under the location mentioned in the article. There is no “public” repository available on git.
From Geraldine_VdAuwera on 2016-10-25
@lalithav The script has been deprecated in favor of the Picard lift over tool as described in the text. If you need a copy you’ll have to check out an older version of the code from github. We don’t provide guidance on that.
From era on 2017-04-26
I had the same problem at the early step of using GATK (SplitNCigarReads)
ERROR contig reads is named chr1B_part1 with length 444204156
ERROR contig reference is named chr1B_part1 with length 438720154 and MD5 8b598b671e24750d69df9acd99d7645c.
However, I used the same reference genome to align my reads using STAR and to split them using SplitNCigarReads. I did not manipulate the reference.
These are some steps I followed to process my illumina RNA-seq data: I removed Ns from reads (trim_galore –trim-n), aligned reads (STAR), sorted them by coordinates (samtools sort), marked duplicates and added group info to bam files (picard and AddOrReplaceReadGroups), and splitNCigar.
I am working on a hexaploid species which has three copies for each 2n chromosomes. For example, the chrs1 in a diploid species has three copies such as chr1A, chr1B and chr1D in my species. In the reference genome, each chromosome is splitted into two parts. So for example, for the chr1A, there are two parts such as chr1A_part1 and chr1A_part2.
Any help, please.
From Sheila on 2017-04-28
@era
Hi,
Can you please tell us which version of GATK you are using and the exact command you ran? Please also post your BAM header and FASTA .dict file.
Thanks,
Sheila
From oskarv on 2017-05-03
My contigs are missing from the VCF files from haplotype caller, I’m using GATK4 for everything up until haplotype caller, where I use gatk4-protected.jar since the other GATK4 doesn’t have haplotype caller in it. Could that be the problem? And I’m using the GRCh38 bundle from your ftp server. I’m using a pair of NA12878 fastq files as input, could that have any impact on the issue?
I attached the code, and just in case I’ll show the entire pipeline until haplotype caller. You’ll notice that I commented out the create index flag because it crashes since the contigs are missing. As a temporary fix I made a script that pastes the contigs into each VCF file, it’s hacky but it lets me finish the pipeline at least.
Edit: In case it matters, I run haplotype caller with scatter gather, I used the wgs_calling_regions.hg38.interval_list from your ftp, and created 49 intervals lists with your python script that creates interval lists.
From Geraldine_VdAuwera on 2017-05-15
@oskarv Can you clarify what you mean by “My contigs are missing from the VCF files from haplotype caller”? And does this still happen if you just call HaplotypeCaller manually rather than going through the pipeline? Note that if you’re using a precompiled gatk4-protected.jar it may have been generated when the HaplotypeCaller port was incomplete. You can try compiling the latest version directly from source and see if the problem still occurs.
From oskarv on 2017-05-18
>
Geraldine_VdAuwera said: >
oskarv Can you clarify what you mean by “My contigs are missing from the VCF files from haplotype caller”? And does this still happen if you just call HaplotypeCaller manually rather than going through the pipeline? Note that if you’re using a precompiled gatk4-protected.jar it may have been generated when the HaplotypeCaller port was incomplete. You can try compiling the latest version directly from source and see if the problem still occurs.
When I used GATK 3.7 the contigs in the header came back, this is a good reminder to not assume alpha releases should be trusted.
From ehscholl on 2017-11-03
I'm having a heck of a time with VariantRecalbrator (and also with HaplotypeCaller) and I can NOT figure it out!
(Using v3.6-0-g89b7209 - going to try a newer version next to see if that helps!)
ERROR MESSAGE: Input files dbsnp and reference have incompatible contigs. Please see https://www.broadinstitute.org/gatk/guide/article?id=63for more information. Error details: The contig order in dbsnp and reference is not the same
; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
ERROR dbsnp contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 3, 30, 31, 32, 33, 34, 35, 36, 37, 38, 4, 5, 6, 7, 8, 9, MT, X]
ERROR reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, X, MT]
The dbsnp contigs are NOT in the order it says. I've run picard tools just to be sure it's using my dict file to reorder the vcf file to be the same: java -jar ~/software/picard-tools-2.1.1/picard.jar SortVCF I=Canisfamiliaris.vcf O=Canisfamiliaris.sorted.vcf SEQUENCE_DICTIONARY=canFam3.dict
.dict file (with the last column removed for privacy's sake..) HD VN:1.5 SO:unsorted @SQ SN:1 LN:122678785 M5:e4671b339daa96b7f11eb0b68fd999d8 @SQ SN:2 LN:85426708 M5:526c549b204117f61cd292042a7127d2 @SQ SN:3 LN:91889043 M5:8eb8096e77c3393d3a733f6e75947ef7 @SQ SN:4 LN:88276631 M5:9c355cda76edca97ddb55861f9e4ddb3 @SQ SN:5 LN:88915250 M5:aff52e36c70d6d3393b23e38493237e2 @SQ SN:6 LN:77573801 M5:1b3141ef2c2e46de690fc3c1405633fd @SQ SN:7 LN:80974532 M5:838450ebc0401f9940ed5ab083b1abb3 @SQ SN:8 LN:74330416 M5:4b1687de592b43bf84a25bc949f21e73 @SQ SN:9 LN:61074082 M5:ef864fc0d3e9f1459bae310181b8181b @SQ SN:10 LN:69331447 M5:17d0aca68c27ef37c9f0ebbd12de292b @SQ SN:11 LN:74389097 M5:1e96358dda2532b5f3708aac2735ba3a @SQ SN:12 LN:72498081 M5:1af4e4d874509655b3e212cf50f7331f @SQ SN:13 LN:63241923 M5:6f65dcdd752ffc98151fde4dcf9ee01e @SQ SN:14 LN:60966679 M5:6d3967996813213e73699a3b8211d138 @SQ SN:15 LN:64190966 M5:371a438c88e9f147f5d4c61bdcd3d792 @SQ SN:16 LN:59632846 M5:ed482bf25a0bcc5fbf659f06d6b41c50 @SQ SN:17 LN:64289059 M5:0826b43cdce3178d6404215550eb3d40 @SQ SN:18 LN:55844845 M5:6e48893b7bfd278ab1df5af50d2484d3 @SQ SN:19 LN:53741614 M5:3707380393157aea14e5b48ff83d5b4f @SQ SN:20 LN:58134056 M5:5c95e877999b589b214b47803f9f2d14 @SQ SN:21 LN:50858623 M5:083c99713ecd10a5d868f869bc160455 @SQ SN:22 LN:61439934 M5:7c01e63f507d5fe5e6e5d0ead462a309 @SQ SN:23 LN:52294480 M5:ead162a374e1db778dd784aa75a3cf6e @SQ SN:24 LN:47698779 M5:4768cf400d4fe63d0f81b0c947151043 @SQ SN:25 LN:51628933 M5:9388c88437e1cef90a4b2b620e311c38 @SQ SN:26 LN:38964690 M5:be48db1d8ebcb9f3a98d935de90f7b22 @SQ SN:27 LN:45876710 M5:4a4d0acf64d0b2ed29ea81cc381d5a2a @SQ SN:28 LN:41182112 M5:1a5bb012b79fb869f8cf9bd30b8e8d8e @SQ SN:29 LN:41845238 M5:43413b6a8749116ebf76690c44137580 @SQ SN:30 LN:40214260 M5:2f64ad1c38580ca188e9d20c577de12a @SQ SN:31 LN:39895921 M5:5b87005a45ed5aecb8002d76c0791ad3 @SQ SN:32 LN:38810281 M5:d355d4fa0308504dc63693f862407cf8 @SQ SN:33 LN:31377067 M5:7d008ea7eb73eeeb44f46947ce51390c @SQ SN:34 LN:42124431 M5:20ed182075d10cc2fc79cc4bd1b558ab @SQ SN:35 LN:26524999 M5:75616cfbafca711d9f19c846fb0ce4d8 @SQ SN:36 LN:30810995 M5:abda6d4222b517c9cd3893318915fc2b @SQ SN:37 LN:30902991 M5:e88ba4208ced145b8c78c908b2cfaa38 @SQ SN:38 LN:23914537 M5:8fe5f9dd3898c367a54caaf88a5f2441 @SQ SN:X LN:123869142 M5:687344e1c4b056567566581427843b05 @SQ SN:MT LN:16727 M5:f7d0daed161cd98504d46f25ab53a038
dbsnp vcf file header without the hashes in front to save it from thinking it's markup (and the file itself is in order):
contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig= contig=
What am I missing? Because I can't see it. :(
From ehscholl on 2017-11-04
I’ve now tried v3.7 and v3.8 –
now it’s telling me my input file and the reference are not the same order – but again, they are.
I’ve even sorted the input file (the haplotype caller output) using the dict file to be sure.
From Sheila on 2017-11-06
@ehscholl
Hi,
Can you try deleting the VCF index and letting GATK re-generate one for you? I think this may be related to a SortVcf bug that is now fixed in the latest version (if you want to upgrade to the latest Picard).
-Sheila
From rkendar on 2018-03-29
Hi,
I am doing variant calling on RNA-seq and follow your RNA-seq best practice. Now, I am doing RealignerTargetCreator (before IndelRealignment). I performed HISAT2 for the alignment using HISAT2 built-in reference which use Ensembl gene annotation (https://ccb.jhu.edu/software/hisat2/indexes.txt). Thus, my BAM have Ensembl format (without “chr”).
For the VCF, I follow GATK recommendation to use Mills_and_1000G_gold_standard.indels.hg38.vcf. But after get this error, I just realise the VCF has different format (with “chr”) with my reference as well as my BAM.
_ERROR MESSAGE: Input files Mills_and_1000G_gold_standard.indels.hg38.vcf and reference have incompatible contig.
ERROR Mills_and_1000G_gold_standard.indels.hg38.vcf contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17,.., chr1_KI270706v1_random, chr4_GL000008v2_random, chr14_GL000009v2_random,…]
ERROR reference contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17,…, KI270728.1, KI270727.1, KI270442.1, KI270729.1, GL000225.1,….]_
Is there any way to change the VCF chromosome ID to Ensembl format? Or is there any better suggestion? Thank you.
From Sheila on 2018-04-01
@rkendar
Hi,
The best thing to do is use the files we provide in our [resource bundle](https://software.broadinstitute.org/gatk/download/bundle). Otherwise, it looks like your reference is b37 rather than hg38. You may be able to use the files we provide in the resource bundle for b37.
Also, you do not need to do Indel Realignment step if you plan to use HaplotypeCaller or Mutect2. Have a look at [this blog post](https://software.broadinstitute.org/gatk/blog?id=7847) for more information.
-Sheila
From rkendar on 2018-04-03
Hi @Sheila ,
> @Sheila said:
> Otherwise, it looks like your reference is b37 rather than hg38. You may be able to use the files we provide in the resource bundle for b37.
>
Thank you for your reply.
I use b38 as my reference (not b37) that I downloaded from Ensembl (Homo_sapiens.GRCh38.dna.primary_assembly.fa).
Actually, I downloaded the VCF file from your resource bundle. Also I have checked your reference genome, and looks that all the file in your resource bundle is hg38 format (with “chr”) :
_chr1
chr10
chr11
chr12
chr13
chr14
chr14_GL000009v2_random
chr14_KI270726v1_random
chr14_KI270846v1_alt
chr15
chr15_KI270850v1_alt
chr16
chr17
chr17_KI270857v1_alt
chr17_KI270862v1_alt
chr17_KI270909v1_alt
chr18
chr19
chr19_KI270938v1_alt
…_
Since I have 300 BAMs using b38, it would be time consuming to redo the alignment using h38. Thus, I think it would be efficient just to change the VCF file rather than the BAMs. Or do you have the VCF in b38 format?
> @Sheila said:
> Also, you do not need to do Indel Realignment step if you plan to use HaplotypeCaller or Mutect2. Have a look at [this blog post](https://software.broadinstitute.org/gatk/blog?id=7847) for more information.
>
Thank you for pointing out :)
How about the Base Calibration step? It’s recommended in best practice, but I am afraid I will get the same error since this step also requires the same VCF file.
From Sheila on 2018-04-04
@rkendar
Hi,
We do not provide b38 format. If you are sure b38 is hg38 with chr removed, it is fine to remove the chr from our files. I could not find confirmation of this with a quick google search.
You should indeed still run Base Recalibration step, as it is like “fire insurance” for you dataset (as Geraldine says).
-Sheila
From rkendar on 2018-04-04
Hi @Sheila , alright thank you for you suggestion :)