created by shlee
on 2016-06-22
This document defines several components of a reference genome. We use the human GRCh38/hg38 assembly to illustrate.
GRCh38/hg38 is the assembly of the human genome released December of 2013, that uses alternate or ALT contigs to represent common complex variation, including HLA loci. Alternate contigs are also present in past assemblies but not to the extent we see with GRCh38. Much of the improvements in GRCh38 are the result of other genome sequencing and analysis projects, including the 1000 Genomes Project.
The ideogram is from the Genome Reference Consortium website and showcases GRCh38.p7. The zoomed region illustrates how regions in blue are full of Ns.
Analysis set reference genomes have special features to accommodate sequence read alignment. This type of genome reference can differ from the reference you use to browse the genome.
_alt
suffix.chr1
–chr22
), X (chrX
), Y (chrY
) and Mitochondrial (chrM
). (ii) Unlocalized sequence are on a specific chromosome but with unknown order or orientation. Identify by _random
suffix. (iii) Unplaced sequence are on an unknown chromosome. Identify by chrU_
prefix.Within GATK documentation, Tutorial#8017 outlines how to map reads in an alternate contig aware manner and discusses some of the implications of mapping reads to reference genomes with alternate contigs.
GATK tools allow for use of a genomic intervals list that tells tools which regions of the genome the tools should act on. Judicious use of an intervals list, e.g. one that excludes regions of Ns and low complexity repeat regions in the genome, makes processes more efficient. This brings us to the next point.
Specifying contigs with colons in their names, as occurs for new contigs in GRCh38, requires special handling for GATK versions prior to v3.6. Please use the following workaround.
HLA-A*01:01:01:01
is a new contig in GRCh38. The colons are a new feature of contig naming for GRCh38 from prior assemblies. This has implications for using the -L
option of GATK as the option also uses the colon as a delimiter to distinguish between contig and genomic coordinates.-L chr1:1-100
. This also works for our HLA contig, e.g. -L HLA-A*01:01:01:01:1-100
.:1+
to the end of the chromosome name as shown below. This ensures that portions of the contig name are appropriately identified as part of the contig name and not genomic coordinates.Because CRAM compression depends on the alignment reference genome, tools that use CRAM files ensure correct decompression by comparing reference contig MD5 hashtag values. These are sensitive to any changes in the sequence, e.g. masking with Ns. This can have implications for viewing alignments in genome browsers when there is a disjoint between the reference that is loaded in the browser and the reference that was used in alignment. If you are using a version of tools for which this is an issue, be sure to load the original analysis set reference genome to view the CRAM alignments.
Yes you should. In addition to adding many alternate contigs, GRCh38 corrects thousands of SNPs and indels in the GRCh37 assembly that are absent in the population and are likely sequencing artifacts. It also includes synthetic centromeric sequence and updates non-nuclear genomic sequence.
The ability to recognize alternate haplotypes for loci is a drastic improvement that GRCh38 makes possible. Going forward, expanding genomics data will help identify variants for alternate haplotypes, improve existing and add additional alternate haplotypes and give us a better accounting of alternate haplotypes within populations. We are already seeing improvements and additions in the patch releases to reference genomes, e.g. the seven minor releases of GRCh38 available at the time of this writing.
Note that variants produced by alternate haplotypes when they are represented on the primary assembly may or may not be present in data resources, e.g. dbSNP. This could have varying degrees of impact, including negligible, for any process that relies on known variant sites. Consider the impact this discrepant coverage in data resources may have for your research aims and weigh this against the impact of missing variants because their sequence context is unaccounted for in previous assemblies.
New 11/16/2016
For a brief history and discussion on challenges in using GRCh38, see the 2015 Genome Biology article Extending reference assembly models by Church et al. (DOI: 10.1186/s13059-015-0587-3).ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/README.txt
and ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/README.txt
, respectively. In addition, the site provides annotation files, e.g. here is the annotation database for GRCh38. Within this particular page, the file named gap.txt.gz catalogues the gapped regions of the assembly full of Ns. For our illustration above, the corresponding region in this file shows:Updated on 2020-01-31
From jingmeng on 2017-01-07
Thanks very much for the post. It is said that in the analysis set of GRCh38, duplicate copies of centromeric and genomic repeat arrays (on chromosomes 5, 14, 19, 21, & 22) and two PAR regions on chromosome Y are hard-masked. I am not sure of the meaning of duplicate copies of repeat arrays. I know a repeat array is that some nucleotides are repeated several times, for example, ATTCGGATTCGGATTCGG (ATTCGG is repeated three times). Then what are duplicate copies for this case? Could you please provide some information (coordinate and nucleotide) of the hard-masked regions in the analysis set? And could you please explain to me why hard-masking these regions can result optimal mapping? For variant calling on the repeat and low complexity regions, as the repeat regions are hard-masked and ignored, no variant is predicted in these regions and it will result in sensitivity decreasing. Is my guess right? Look forward to hearing from you.
From shlee on 2017-01-09
Hi @jingmeng,
We’ll get to your question soon.
From shlee on 2017-01-13
Hi @jingmeng,
Yes, indeed I do phrase it as you say and I agree this could be worded better as the current wording is confusing, especially given folks study actual copy number variants and this is not what I mean.
> For example, the GRCh38 analysis set hard-masks, i.e. replaces with Ns, duplicate copies of centromeric and genomic repeat arrays (on chromosomes 5, 14, 19, 21, & 22) and two PAR regions on chromosome Y.
To be more precise, here’s what I’ve come up with and I hope this clarifies your questions.
> For example, the GRCh38 analysis set hard-masks, i.e. replaces with Ns, a proportion of homologous centromeric and genomic repeat arrays (on chromosomes 5, 14, 19, 21, & 22) and two PAR (pseudoautosomal) regions on chromosome Y.
As for what types of repeat arrays the genome masks, this is more detail than our current user base needs. However, if you find this is important to current genomics research, please let us know. For now, you can read up on general repeat array types in [this wikipedia entry](https://en.wikipedia.org/wiki/Repeated_sequence_(DNA)) and satellite arrays in [this wiki entry](https://en.wikipedia.org/wiki/Satellite_DNA). It’s my understanding, and perhaps someone in the community can clarify this, that masking these regions allows for better short read mapping to the reference. It’s my impression that those in the field have come to this conclusion empirically. If understanding this further is important to your research, then I can ask those in the know here and get back to you in a week or two.
As for variant calling on repeat and low complexity regions, well, with current technology this seems (in my opinion) an exercise in futility. I fathom such regions are subject to high rates of homologous recombination and polymerase slippage so I suspect variability between people will be high. Also, how can we effectively, by orthogonal methods, confirm the validity of the sequence in such regions? In other words, how can we determine the truth to verify our short-read sequence alignments? The assumptions we use for 4 nucleotides and paired short read sequences cannot apply equally well to regions that only use 1-3 nucleotides. Jingmeng, how much would you trust any variation reported for such regions with the current sequencing technology?
As for the actual coordinates of the masked repeat arrays, I refer you to reference 5 of the external resources. The [gap.txt.gz](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/gap.txt.gz) file tells you exactly which regions are gaps, i.e. hard-masked by Ns. Also, [this GRC page](https://www.ncbi.nlm.nih.gov/grc/human/data) has a subtab called Gaps. If you click on it, they show two different types of gaps, spanned gaps and unspanned gaps, which they then define further and show metrics for within GRCh38. You can click on column headers to sort, as I show in the attached screenshots.
I hope this is helpful and thanks for the feedback. Cheers.
Screenshot 2017-01-13 11.58.45.png
Screenshot 2017-01-13 12.00.36.png
From mmokrejs on 2017-01-31
> This means that if we align sequence reads to GRCh38+ALT blindly, then we obtain many multi-mapping reads with zero mapping
> quality. Since many GATK tools have a ZeroMappingQuality filter, we will then miss variants corresponding to such loci.
I am just guessing what you describe but I also realized I have lost valid read pairs because one read was mapped to say chr11 and the other mate to chr11__XXXX_alt. My ‘bwa mem’ run was likely not aware of me serving it alternate contigs and discarded the read pair during mate-pair distance check. The reads were assigned mapQ=0. I believe this is what you are talking about. However, I think it is a consequence of me running ‘bwa mem’ in a wrong way. Read further below.
> Recent releases of BWA, e.g. v0.7.15+, handle alt contig mapping and HLA typing. See the BWA repository for information.
> See these pages for download and installation instructions.
Unfortunately the bwa docs do not explain How to make bwa ALT-aware. Below is the way how I call ‘bwa mem’.
bwa mem -t 16 -p -M -R ‘@RG\tID:XXXX-PB\tPL:ILLUMINA\tLB:S04380110 Agilent SureSelect Human All Exon V5\tSM:XXXX-PB’ ftp.broadinstitute.org/bundle/hg38/
hg38bundle/Homo_sapiens_assembly38.fasta.gz dedup.realigned.pairs.fastq | \
samtools view -@ 15 -T ftp.broadinstitute.org/bundle/hg38/hg38bundle/Homo_sapiens_assembly38.fasta.gz -Sb -o XXXX-PB.bwa.pairs.bam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
I believe the above “read 0 ALT contigs” means my bwa needed some additional *.alt file along the indexed reference which is not present in hg38bundle and I did not “just cherry-pick?” the file on my own inside the hg38bundle/ directory .
bwa.kit/run-gen-ref hs38DH
See:
https://sourceforge.net/p/bio-bwa/mailman/bio-bwa-help/thread/1516D2DE-59F1-4803-AD5A-6A901F59C076%40me.com/#msg33179172
https://github.com/lh3/bwa/tree/master/bwakit
https://github.com/lh3/bwa/blob/master/bwakit/run-gen-ref
From mmokrejs on 2017-02-01
$ bwa.kit/run-gen-ref hs38DH
[bwa_index] Pack FASTA… 36.89 sec
[bwa_index] Construct BWT for the packed sequence…
[BWTIncCreate] textLength=6434693834, availableWord=464768632
…
[bwt_gen] Finished constructing BWT in 711 iterations.
[bwa_index] 4178.04 seconds elapse.
[bwa_index] Update BWT… 26.76 sec
[bwa_index] Pack forward-only FASTA… 26.02 sec
[bwa_index] Construct SA from BWT and Occ… 1381.03 sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa index hs38DH.fa
[main] Real time: 5648.259 sec; CPU: 5647.889 sec
…
-rw———- 1 mmokrejs mmokrejs 487553 Jan 31 22:33 hs38DH.fa.alt
-rw———- 1 mmokrejs mmokrejs 3263690197 Jan 31 22:33 hs38DH.fa
-rw———- 1 mmokrejs mmokrejs 3217347004 Jan 31 23:44 hs38DH.fa.bwt
-rw———- 1 mmokrejs mmokrejs 804336731 Jan 31 23:44 hs38DH.fa.pac
-rw———- 1 mmokrejs mmokrejs 455474 Jan 31 23:44 hs38DH.fa.ann
-rw———- 1 mmokrejs mmokrejs 20199 Jan 31 23:44 hs38DH.fa.amb
-rw———- 1 mmokrejs mmokrejs 1608673512 Feb 1 00:07 hs38DH.fa.sa
$
Basically, it seems I could have just copied&renamed hs38DH.fa.alt from bwa.kit/resource-GRCh38/ into the hg38bundle/ directory and re-created ‘bwa index’.
From shlee on 2017-02-01
Hi @mmokrejs,
Please check out [Tutorial#8017](http://gatkforums.broadinstitute.org/gatk/discussion/8017/how-to-map-reads-to-a-reference-with-alternate-contigs-like-grch38). It’s a long tutorial with details that address many of your questions in the posts above. There are links within the document to the GRCh38 resources that alt-aware mapping requires. After reading the tutorial, if you have specific questions on the contents of the tutorial, then please let us know.
From shlee on 2017-03-08
- Picard has a tool, [ScatterIntervalsByNs](https://broadinstitute.github.io/picard/command-line-overview.html#ScatterIntervalsByNs), that can help you create intervals between gapped regions.
- The tool to use to create an intersection of intervals is Picard’s [IntervalListTools](https://broadinstitute.github.io/picard/command-line-overview.html#IntervalListTools).
From oskarv on 2017-04-21
> @shlee said:
> – Picard has a tool, [ScatterIntervalsByNs](https://broadinstitute.github.io/picard/command-line-overview.html#ScatterIntervalsByNs), that can help you create intervals between gapped regions.
> – The tool to use to create an intersection of intervals is Picard’s [IntervalListTools](https://broadinstitute.github.io/picard/command-line-overview.html#IntervalListTools).
I downloaded the GRCh38-p10 version from ftp://ftp.ensembl.org/pub/release-88/fasta/homo_sapiens/dna/ and tried to create an interval list for scatter gather use, but it doesn’t work.
Here’s the command and the error message:
```
java -jar picard.jar ScatterIntervalsByNs R=Homo_sapiens.GRCh38.dna.primary_assembly.fa O=output.interval_list
[Fri Apr 21 15:33:17 CEST 2017] picard.util.ScatterIntervalsByNs REFERENCE=Homo_sapiens.GRCh38.dna.primary_assembly.fa OUTPUT=output.interval_list OUTPUT_TYPE=BOTH MAX_TO_MERGE=1 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Apr 21 15:33:17 CEST 2017] Executing as user@computer on Linux 4.4.0-70-generic amd64; Java HotSpot™ 64-Bit Server VM 1.8.0_121-b13; Picard version: 2.5.0(2c370988aefe41f579920c8a6a678a201c5261c1_1466708365)
[Fri Apr 21 15:33:17 CEST 2017] picard.util.ScatterIntervalsByNs done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=251658240
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread “main” java.lang.NullPointerException at picard.util.ScatterIntervalsByNs.segregateReference(ScatterIntervalsByNs.java:141) at picard.util.ScatterIntervalsByNs.doWork(ScatterIntervalsByNs.java:107) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
```
I tried with picard 2.9 too but it gave the same error message. The reference file on your ftp works as well as this one: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/GRCh38.p10.genome.fa.gz, but the one I need to use is the absolute latest one. What could be wrong with it?
From shlee on 2017-04-26
Hi @oskarv,
Both `GRCh38.p10.genome.fa` (Sanger) and `Homo_sapiens.GRCh38.dna.primary_assembly.fa` (ENSEMBL) files you give work fine in my hands with ScatterIntervalsByNs. Be sure to create fresh index (samtools faidx) and dictionary (Picard CreateSequenceDictionary) files before running ScatterIntervalsByNs.
One thing I notice is that the two FASTAs annotate and sort contigs differently:
```
WMCF9-CB5:~ shlee$ gzcat ~/Downloads/GRCh38.p10.genome.fa.gz | grep ‘>‘
>chr1 1
>chr2 2
```
```
WMCF9-CB5:~ shlee$ gzcat ~/Downloads/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz | grep ‘>‘
>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
>10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF
```
After running the FASTA through CreateSequenceDictionary using the tool’s default settings, this translates to each header as follows:
```
HD VN:1.5
SQ SN:chr1 LN:248956422 M5:2648ae1bacce4ec4b6cf337dcae37816 UR:file:/Users/shlee/Downloads/GRCh38.p10.genome.fa.gz
@SQ SN:chr2 LN:242193529 M5:4bb4f82880a14111eb7327169ffb729b UR:file:/Users/shlee/Downloads/GRCh38.p10.genome.fa.gz
```
```
HD VN:1.5
SQ SN:1 LN:248956422 M5:2648ae1bacce4ec4b6cf337dcae37816 UR:file:/Users/shlee/Downloads/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
@SQ SN:10 LN:133797422 M5:907112d17fcb73bcab1ed1c72b97ce68 UR:file:/Users/shlee/Downloads/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
```
Be aware of this difference in ordering and nomenclature.
From shlee on 2017-05-01
Hmm, there appear to be some odd Markdown format interpretations going on with our forum site and elements in codeblocks appear unprotected.
Should be:
Screenshot 2017-05-01 16.35.02.png
Screenshot 2017-05-01 16.35.13.png
From zhaoqiang90 on 2017-07-18
Hi @shlee,
I have some question about human reference genome. What is patch version of GRCh38 in GATK Resource Bundle? The annotation database(such as NCBI GRCh38.p7.RefSeq and dbsnp) are mainly for GRCh38.p7, so is OK if I use Resource Bundle reference for alignment and GRCh38.p7 for annotation? Can this cause some error in snp calling?
From shlee on 2017-07-18
Hi @zhaoqiang90,
As far as I know, the GATK Resource Bundle provides only major releases, without any patch updates. Let me get back to you about your second question.
From shlee on 2017-07-18
@zhaoqiang90,
The subsetting of contigs works only one way. It should work for your case where every one of your variant contigs is represented in the resource. The converse may not work however. If the tool encounters a variant on a contig that the resource does not represent, I believe it will error.
Currently, VariantAnnotator is available only in GATK3. We have plans to port it to GATK4 and it is slated to be available late this year.
From zhaoqiang90 on 2017-07-21
Hi @shlee
I still have some doubts.
1. If I use GRCh38.p7 as my reference genome, is it OK to use the file in GATK resource bundle as my training set at GATK VQSR? If it’s not OK, is that means that I must use the reference genome GATK resource bundle?
2. If I use GATK resource bundle’s human reference genome. In the annotation step(usually SnpEff or Annovar), if I use a annotation database which(for example) support for GRCh38.9p7, it this will cause some error in my final result because of minor version’s difference? Or I should the database that support GRCh38 major release?
3. As mentioned in GATK page, Exome files and itemized resource list coming soon(ish) for GRCh38. If I don’t have these file to assist my WES analysis, would this caused some error in the final result?
From shlee on 2017-07-21
@zhaoqiang90,
The developers tell me that unfortunately, we cannot predict how each tool will handle mismatching contig header lines for input files. So you will have to see how it goes with your analysis. As to the scientific validity of mixing references, this is a question you must answer for yourself.
I think the bigger issue to consider is whether your alignment settings handle patched information correctly.
> (i) Fix patches represent sequences that will replace primary assembly sequence in the next major assembly release. When interpreting data, fix patches should take precedence over the chromosomes.
> (ii) Novel patches represent alternate loci. When interpreting data, treat novel patches as population sequence variants.
If a read aligns to a fixed patch better than the primary assembly, which alignment representation does the aligner prioritize? Could the two possible alignments end up with MAPQ 0 and secondary alignment flags? If this is the case, then you get less information for the locus than if you had used just the major release of the reference because our callers ignore secondary alignments. If one of the alignments ends up a supplementary alignment, are the mapping qualities distributed in such a way that allows our callers to use the alignment (e.g. HCMappingQualityFilter)? Did the mate align to the other contig, e.g. one mate on the patch contig and the other mate on the primary assembly? Our workflows, without additional tweaking, currently do not call on reads with such criss-crossed mates.
Unless there is a strong reason for your research to require the patches, e.g. a specific patch contig representing a drastically altered haplotype important to your sample individuals, and you are equip to deal with the vagaries of using patches, e.g. adding these to the alt contig index file for alt-aware BWA handling or some additional post-alignment-processing, then I think you are better off using the major release version for our workflows.
As for your questions:
1. I think it should be okay to train VQSR with the resource bundle file.
2. I’m not familiar with SnpEff nor Annovar as they are not GATK tools and so I cannot comment on your question. At what step do you annotate, what do you annotate and what are the annotations used for?
3. You should be using exome target intervals provided by your exome kit provider.
From markw on 2017-12-11
These definitions may be helpful for users looking for genomes sequences from RefSeq:
RefSeq category – shown if the assembly is a reference or representative genome in the NCBI Reference Sequence (RefSeq) project classification:
Source: [https://www.ncbi.nlm.nih.gov/assembly/help/#glossary](https://www.ncbi.nlm.nih.gov/assembly/help/#glossary “https://www.ncbi.nlm.nih.gov/assembly/help/#glossary”)
From jacobhsu on 2019-05-13
Hi,
I’m a bit confused. What are the differences between
1) Homo_sapiens_assembly38.fasta in GATK bundle, and
2) The file from Heng Li’s suggestions (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz).
Are they identical? Both of them are the major release of hg38, right?
From songofbin on 2019-08-08
Hi @jacobhsu, I think that’s same with GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna.gz.
From jacobhsu on 2019-12-06
Hi @songofbin
I have done some header comparisons. They are at least two differences.
1) Start from contig chrUn_KN707606v1_decoy, the r1 header in GATK bundle hg38 shows “unplaced”, but the one from NCBI (GCA_000001405.15_GRCh38_full_analysis_set.fna.gz) shows “decoy”.
2) NCBI (GCA_000001405.15_GRCh38_full_analysis_set.fna.gz) doesn’t contain HLA alternative contig.
However, GATK bundle hg38 is as same as from 1000G [ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/] except r1 header.
Not sure will r1 header affect any downstream analyses?
Please note that I only compare headers in each file.