Reference genome

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by GATK_Team

on 2017-12-24

This document covers the general motivation behind the use of genome references, as well as some terminology and related considerations. For more specific information about human genome reference assemblies, please see the Dictionary entry on Human genome reference builds. For help dealing with reference compatibility problems, see this Solutions doc. For information on the FASTA format and accompanying index files, see the Dictionary entry on FASTA.

Background
Choosing a reference genome build
Nomenclature: words we use to describe components of reference genomes
Recommended genome browser: IGV

1. Background

Consider this a central dogma of GATK: all genome analyses are (or should) be done relative to a common reference sequence.

Why? Let's look at a similar, if simpler problem. We have three modern-day sentences that we know evolved from a common ancestor:

The quick brown fax jumped over the lazy doge.
The quick _ fox jumps over the lazy doge.
The quick brown fox jumps over the lazy brown dog.

We'd like to inventory their differences in a way that is not biased toward any single one of them, and is robust to the possibility of adding new mutant sentences as we encounter them. So we create a synthetic hybrid that encapsulates what they have most in common, yielding:

The quick brown fox jumped over the lazy doge.

We can use this as a common reference coordinate system against which we can plot what is different (if not necessarily unique) in each mutant:

Fourth word, o->a substitution; ninth word, deletion of "e"
Third word deleted; fifth word ed->s substitution; ninth word, addition of "e"
Fifth word ed->s substitution; duplication of the third word located after the eighth word.

It's obviously not a perfect method, and what it gives us is not the ancestral sentence -- we suspect that's not how "dog" was originally spelled, and we're unsure of the original tense (jumps vs. jumped -- but it enables us to distinguish what's "normal" (in the sense that it's the norm in the population we have access to) from what's divergent.

The more sentences we can involve in the initial formulation of this reference, and the more representative the sampling, the more appropriate it will be for describing the variations we encounter in the future.

That is exactly what we are doing when we use a reference genome: rather than attempting to chart differences between genome sequences relative to each other, which gets horribly complicated as soon as we involve more than two sequences, we chart them relative to a common standard. At that point it becomes much more tractable (if not completely trivial) to identify what subset of variations in sequence are commonly observed vs. unique to samples, individuals or sets thereof.

So whose genome do we use as common standard? No one's, and hopefully everyone's. In the simplest case, any individual genome can be used as a reference genome. However, the quality and sensitivity of analysis is increased when the reference genome is more representative of the widest group of individuals we might want to study. So each segment of the genome reference should feature the sequence most commonly observed across available individual genomes. The resulting reference genome is therefore a synthetic hybrid that serves as archetype but whose sequence is not actually observed wholesale in any particular individual genome.

As an additional twist, note that all current standard reference genome sequences are haploid, meaning they represent only one copy of each chromosome (or contig). The most immediate consequence is that in diploid organisms such as humans, which have two copies of each autosome (= any chromosome that is not a sex chromosome, X or Y), the choice of standard representation of sites that are most often observed in heterozygous state (manifesting two different alleles, e.g. A/T) is largely arbitrary. This is obviously even worse in polyploid organisms, such as many plants including wheat and strawberries, which have higher numbers of chromosome copies. While it is possible to represent reference genomes using graph-based representations, which would address this problem, few genome analysis tools are able to handle such representations at this time. See this article for further discussion.

2. Choosing a reference genome build

Whether you work with a model organism or a non-model organism, chances are there may be more than one reference build, or assembly, available. In the case of the human genome it used to be a huge problem, though the advent of the latest one (GRCh38/Hg38) seems to have reduced complexity by some degree. We're not very familiar with the situation for other organisms but anecdotal reports suggest this is a fairly common problem.

In practice, the biggest problem is that once you've started working with a particular reference build, it's difficult to switch to another or incorporate an external resource that was derived from a different build. We have an entire document here dedicated to the problems that can arise in such situations.

The best thing you can do to make life easier for your future self is, when you prepare your experimental design, choose the reference build you're going to use with great care. You should consider (1) what resources are going to be necessary and what is available for the various builds you are looking at, (2) what your colleagues or prospective collaborators already use, (3) what people in your field most frequently use, and (4) no actually that's it, it's just those three.

3. Nomenclature: words we use to describe components of reference genomes

There's a whole bunch of jargon specifically associated with reference genomes; we've tried to collect the most common ones here, but if you find any that you think we should add, just let us know in the comments.

Analysis set reference genomes have special features to accommodate sequence read alignment. This type of genome reference can differ from the reference you use to browse the genome. See the document on the Human genome reference builds for an example.
A contig is a contiguous sequence without "physical" gaps (stretches of "N" bases are not considered gaps in this context), such as a chromosome. Can also be a scaffold in incomplete assemblies, a plasmid in bacterial genomes, and so on.
Alternate contigs, alternate scaffolds or alternate loci allow for representation of diverging haplotypes in regions that are too complex for a single representation. See the document on the Human genome reference builds for more discussion on the purpose and usage of ALT contigs.
Primary assembly refers to the collection of (i) assembled chromosomes, (ii) unlocalized (known to belong on a specific chromosome but with unknown order or orientation) and (iii) unplaced (chromosome unknown) sequences. It represents a non-redundant haploid genome.
PAR stands for pseudoautosomal region. PAR regions in mammalian X and Y chromosomes allow for recombination between the sex chromosomes. Because the PAR sequences together create a diploid or pseudo-autosomal sequence region, the X and Y chromosome sequences are intentionally identical in the genome assembly. Analysis set genomes further hard-mask two of the Y chromosome PAR regions so as to allow mapping of reads solely to the X chromosome PAR regions.
Different assemblies shift coordinates for loci and are released infrequently. In the human genome context, Hg19 and GRCh38/hg38 represent two different major assemblies. Comparing data from different assemblies requires lift-over tools that adjust genomic coordinates to match loci, at times imperfectly.
Patches are regional fixes that are released periodically for a given assembly. They are intended to improve representation or add information to the assembly without disrupting the chromosome coordinates. There are two types of patches, fixed and novel, representing different types of sequence changes.

4. Recommended genome browser: IGV

We recommend using the Integrative Genome Viewer (IGV) for browsing/viewing genome sequence data. IGV is a desktop application for viewing genomics data including alignments. The tool is able to use reference genomes you provide via file or URL, or one of the many that it hosts over a server. The numerous hosted reference genomes include GRCh38. See this page for information on hosted reference genomes. For the most up-to-date list of hosted genomes, open IGV and go to Genomes>Load Genome From Server. A menu lists genomes you can make available in the main genome dropdown menu.

Why do we recommend IGV specifically? There are admittedly other genome browsers out there that are fully functional and perfectly pleasant; however we have a close relationship with the developers of IGV (who originally started it at the Broad Institute down the hall from us) so it is convenient for us to keep using it. You can of course use whatever browser you wish; just be aware that all of our screenshots in documentation materials and our tutorials, both online and at onsite workshops, make exclusive use of IGV.

Viewing CRAM alignments on genome browsers

Because CRAM compression depends on the alignment reference genome, tools that use CRAM files ensure correct decompression by comparing reference contig MD5 hashtag values. These are sensitive to any changes in the sequence, e.g. masking with Ns. This can have implications for viewing alignments in genome browsers when there is a disjoint between the reference that is loaded in the browser and the reference that was used in alignment. If you are using a version of tools for which this is an issue, be sure to load the original analysis set reference genome to view the CRAM alignments.

Updated on 2018-01-09

Report abuse