created by jonn
on 2019-01-29
This page explains the discrepancies between the different "hg19" references.
There are 4 common "hg19" references, and they are NOT directly interchangeable: * hg19 (ucsc.hg19.fasta
, MD5sum: a244d8a32473650b25c6e8e1654387d6
) * b37 (Homo_sapiens_assembly19.fasta
, MD5sum: 886ba1559393f75872c1cf459eb57f2d
) * GRCh37 (GRCh37.p13.genome.fasta
, MD5sum: c140882eb2ea89bc2edfe934d51b66cc
) * humanG1Kv37 (human_g1k_v37.fasta
, MD5sum: 0ce84c872fc0072a885926823dcd0338
)
The Genome Reference Consortium Human Build 37, GRCh37, (GRCh37.p13.genome.fasta
, MD5sum: c140882eb2ea89bc2edfe934d51b66cc
) is a Homo Sapiens genome reference file built by the Genome Reference Consortium. This is a baseline human genome reference and serves as the basis for the other three references in this comparison.
For more information on GRCh37, visit the official Genome Reference Consortium website.
The following are links to the GRCh37 reference: * FASTA
The University of California at Santa Cruz (UCSC) has created a reference based on GRCh37. This reference is often referred to as hg19 (ucsc.hg19.fasta
, MD5sum: a244d8a32473650b25c6e8e1654387d6
).
This reference contains some alterations from the baseline reference from the Genome Reference Consortium. These alterations largely consist of contig name changes, however there are known sequence differences on some contigs as well.
For details see the comparison table.
The following are links to the hg19 reference: * FASTA
The Broad Institute created a human genome reference file based on GRCh37. This reference is often referred to as b37 (Homo_sapiens_assembly19.fasta
, MD5sum: 886ba1559393f75872c1cf459eb57f2d
).
When people at The Broad Institute's Genomics Platform refer to the hg19 reference, they are actually referring to b37
.
This reference contains some alterations from the baseline reference from the Genome Reference Consortium. These alterations largely consist of contig name changes, however there are known sequence differences on some contigs as well.
Anecdotally the changes are for bases for which there was low confidence. Those low confidence bases were then masked out in the b37
reference to be the IUPAC
symbol for any base. However, there does not seem to be a detailed comparison readily available.
For details see the comparison table.
The following are links to the b37 reference: * FASTA * FASTA Index * Sequence Dictionary
The humanG1Kv37 (human_g1k_v37.fasta
, MD5sum: 0ce84c872fc0072a885926823dcd0338
) reference is equivalent to b37, with the exception that it does not contain the decoy sequence for human herpesvirus 4 type 1 (named NC007605_). This reference grew out of the 1000 Genomes Project.
For details see the comparison table.
The following are links to the HumanG1Kv37 reference: * FASTA
The specific differences between these four references are detailed in the following table.
The contigs with identical MD5sums are specified in each row. In the case that the MD5sum does not match between the references (indicating a sequence difference), the row will have a blank entry for that contig (----
).
Primary contigs with differing MD5sums are highlighted in red. Alternate contigs with differing MD5sums are highlighted in orange.
This table indicates that while most contigs contain the same data, there are several with sequence differences between the references. Among those are Chromosome 3, Chromosome Y, and the Mitochondrial Contig.
Anecdotally the changes are for bases for which there was low confidence, with those low confidence bases masked out to be the IUPAC
symbol for any base. However, there does not seem to be a detailed comparison readily available (i.e. there's no proof that this is true).
Therefore, when doing comparisons across the four reference versions for each of these contigs, some care should be taken.
Some further details can be found on this DNAnexus wiki page as of 2019/01/30.
Updated on 2019-08-20
From stachyra on 2019-06-18
If possible, could you please post links to files that actually have these MD5 checksums? In the past, I’ve tried downloading versions of GRCh37 from what I would typically consider trusted sources, only to come up disappointed. For example, this version of GRCh37 from Ensembl (ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz) while technically compliant with the .fasta file standard, is lexicographically sorted (i.e., 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, MT, X, Y, etc.) as opposed to logically sorted (1, 2, 3, …, 21, 22, X, Y, MT, etc.). While lexicographic sorting isn’t necessarily wrong, I do consider it undesirable and kind of silly, not to mention it tends to trip up GATK tools in instances where the tools require a well-defined sort order. It would be nice to know which versions of these files the Broad data science team recognizes as their own internal standards.
From jonn on 2019-08-20
Sure – I’ll try to round up my sources.
From jonn on 2019-08-20
As an addendum – while some teams at the Broad use a few different references, the ones we most commonly use are B37 and HG38 (not listed here).
The link to B37 above is now correct (and hosted from our Google bucket).