created by GATK_Team
on 2017-12-24
Some Picard tools require a haplotype map that maps SNPs to LD (linkage disequilibrium) blocks. These tools include CrosscheckReadGroupFingerprints and CheckFingerprint. You can find a poster about fingerprinting here.
For these tools, the HAPLOTYPE_MAP parameter is used to specify the file. There are two acceptable formats for this file: a plain text-based file with tab-separated fields, and VCF (supported extensions: .vcf
, .vcf.gz
or .bcf
), following the requirements outlined below.
It has a header and a body as shown below.
The header is a standard SAM header, with an @HD
line to define the file type and @SQ
lines to define the reference contigs. You can easily derive such a header from your reference dictionary file.
The body contains a column header line starting with a #
hash followed by lines that annotate SNPs and blocks in high LD.
Again, the SNPs listed with the same ANCHORSNP will be in the same haplotype. If there is a discrepancy between the MAFs within a block, the tool considers the MAF of the first SNP, _i.e. that with the smallest genomic position, the MAF of the block. Again, MAF stands for minor allele frequency.
Starting with Picard version v2.10.1 (released 2017/7/11), tools will recognize a VCF format if the file extension ends in .vcf
, .vcf.gz
or .bcf
. Tools will interpret all other file extensions fas the original text-based format we describe above.
Click here to download an example file. Here is the body portion of this example file.
0/1
or 0|1
.|
) and the PS (phase set) format field annotation.Finally, the VCF specification (v4.2) defines the PS field as follows:
PS : phase set. A phase set is defined as a set of phased genotypes to which this genotype belongs. Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set. A phase set specifies multi-marker haplotypes for the phased genotypes in the set. All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. If the genotype in the GT field is unphased, the corresponding PS field is ignored. The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). (Non-negative 32-bit Integer)