Haplotype map format

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by GATK_Team

on 2017-12-24

Some Picard tools require a haplotype map that maps SNPs to LD (linkage disequilibrium) blocks. These tools include CrosscheckReadGroupFingerprints and CheckFingerprint. You can find a poster about fingerprinting here.

For these tools, the HAPLOTYPE_MAP parameter is used to specify the file. There are two acceptable formats for this file: a plain text-based file with tab-separated fields, and VCF (supported extensions: .vcf, .vcf.gz or .bcf), following the requirements outlined below.

The original haplotype map file format

It has a header and a body as shown below.

The header is a standard SAM header, with an @HD line to define the file type and @SQ lines to define the reference contigs. You can easily derive such a header from your reference dictionary file.

The body contains a column header line starting with a # hash followed by lines that annotate SNPs and blocks in high LD.

    • NAME is a SNP identifier, e.g. dbSNP rsID.
    • MAF is the minor allele frequency.
    • ANCHORSNP refers to the NAME of a SNP that groups SNPs in high LD with each other. The tool counts all of the SNPs with the same ANCHORSNP as one group.
    • Although the column header requires the PANELS label, the PANELS column field value is optional.

Again, the SNPs listed with the same ANCHORSNP will be in the same haplotype. If there is a discrepancy between the MAFs within a block, the tool considers the MAF of the first SNP, _i.e. that with the smallest genomic position, the MAF of the block. Again, MAF stands for minor allele frequency.

The VCF-based haplotype map

Starting with Picard version v2.10.1 (released 2017/7/11), tools will recognize a VCF format if the file extension ends in .vcf, .vcf.gz or .bcf. Tools will interpret all other file extensions fas the original text-based format we describe above.

Click here to download an example file. Here is the body portion of this example file.

    • The VCF format haplotype map contains exactly one sample whose genotype calls are all heterozygous, e.g. 0/1 or 0|1.
    • The tool determines haplotype block grouping using phased genotypes (with a pipe |) and the PS (phase set) format field annotation.
    • The INFO field's AF annotation refers to the alternate allele frequency. This is not necessarily the minor allele frequency. This differs from the original haplotype map file format's requirement.

Finally, the VCF specification (v4.2) defines the PS field as follows:

PS : phase set. A phase set is defined as a set of phased genotypes to which this genotype belongs. Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set. A phase set specifies multi-marker haplotypes for the phased genotypes in the set. All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. If the genotype in the GT field is unphased, the corresponding PS field is ignored. The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). (Non-negative 32-bit Integer)