HapBlock - The Dynamic Programming Algorithms for Haplotype Block Partitioning and Tag SNPs Selection by Haplotype Data and Genotype Data
Programs
Our algorithms have been implemented in a program by C++, here are the pre-compiled executable files:
The executable file for Unix operating system (Sun OS Release 5.6 Version).
The executable file for Linux operating system.
The executable file for Windows operating system (Windows 95,95, NT, 2000,XP).
After downloading the package, unzip it and copy all files into the same file folder. A brief Help File (PDF format) for how to use this program can also be downloaded.
NEW FEATURES of the Program
Three Dynamic Programming Algorithms:
A dynamic programming algorithm to find a block partition with minimum number of tag SNPs.
A paramatric dynamic programming algorithm for block partitioning with a fixed genome coverage using the minimum number of tag SNPs.
A discrete dynamic programming algorithm for block partitioning with a fixed number of tag SNPs that can cover maximum length of genome.
New types of data can be processed:
Haplotype data.
Genotype data from unrelated individuals.
Genotype data from general pedigrees.
Three different haplotype block definitions:
The coverage of common haplotypes.
LD based block.
No historical recombination
A Set of Tag SNP Selection Options:
Common haplotype.
Haplotype diversity.
Haplotype entropy.
Haplotype determination coefficient.
LD measure r^2.
Common Haplotype with missing data.
Attention: This program is free for academic use. The commercial users should contact Dr. Sun before using this program. The part of program is produced with collaboration with Steve Qin and Jun Liu in the department of statistics at Harvard University. The souce code of program has not been provided and may be available upon request.
Copyright © 2003 The University of Southern California. All RIGHTS RESERVED.
Test Data Sets
The haplotype data is simulated by Coalesence Process with recombination implemented in the lab of Richard Hudson. The following data sets are used in our testing and can be used for exploring our program:
A haplotype data file - TestHap.dat, which contains 80 haplotypes consisting of 234 SNPs.
A genotype data file - TestGenoInd.dat, which contains 40 individuals and is generated by randomly pairing up haplotypes.
A genotype data file - TestGenoPed.dat, which contains 20 trios nad is generated from haplotype data.
A file for all the possible blocks and their number of tag SNPs - TestBT.dat, which is generated by our program.
A map file for all the SNPs - TestSNPSimuPos.dat, in which the genomic position of each SNPs is listed.
A map file for all the SNPs - TestSNPIndexPos.dat, in which the position of each SNPs is consecutively coded from 1 to the number of SNPs.
A file for pre-selected Tag SNPs - TestPreSNP.dat.
A file for the name of all SNPs - TestSNPName.dat.
Results: Blocks, Tag SNPs and Haplotype Patterns
We test our program based on aforementioned data using a number of different setting of parameters. In the following, we list the parameter file, the corresponding output files and the short explainations for those parameters and results. For the contents and format of these files and the meaning of each parameter in the parameter file, please refer to our help file (PDF format).
As for illustration, the following paramter files share many common parameters (Attention: since several options have been implenmented in the new version of program, the format parameter file is sligly different from the previous one. Please refer to the help file for details.). In all examples, the haplotypes with frequency greater than 4.99% are considered as common haplotypes. We also set the maximum number of samples, the maximum number of SNPs, the maximum length of a block as 100, 250 and 100, respevtively.
The dynamic programming algorithm is used to find a block partition with minimum number of tag SNPs on the basis of haplotype data. A set of consecutive SNPs with size one or more are defined as a block only if the percentage of common haplotypes is more than 80%. The tag SNPs are a minimum set of SNPs that can distinguish a set of common haplotypes that can acocunt for at least 80% of all unambiguous haplotypes. The program will output all intermediate results, all possible sets of tag SNPs and haplotype patterns. The tag SNPs are labeld by their names provided from file TestSNPName.dat. 100 permutations are performed and the statistics of interest are stored in a file. Please find the setting of parameters and results from the following files:
The file TestParSet-001.dat contains the setting of parameters.
The file TestBlockRes-001.dat contains the information of blocks and tag SNPs.
The file TestHapPat-001.dat contains the haplotype patterns in each block.
The file TestPermute-001.dat contains the results of permutation test.
The dynamic programming algorithm is used to find a block partition with minimum number of tag SNPs on the basis of genotype data from unrelated individuals. Here, the haplotype pairs assigned to each individual are used. A set of consecutive SNPs with one or more are defined as block only 80% of all pair-wise LD greater than 0.70. The tag SNPs are a minimum set of SNPs that can distinguish all common haplotypes within a block. The program will output all intermediate results, all possible sets of tag SNPs and haplotype patterns. The tag SNPs are labeld by their names provided from file TestSNPName.dat. No permutations are performed. Please find the setting of parameters and results from the following files:
The file TestParSet-002.dat contains the setting of parameters.
The fileTestBlock-002.dat contains the information of blocks and tag SNPs.
The file TestHapPat-002.dat contains the haplotype patterns in each block.
The dynamic programming algorithm is used to find a block partition with minimum number of tag SNPs on the basis of genotype data from pedigrees. A set of consecutive SNPs with size one or more are defined as a block only if the percentage of common haplotypes is more than 80%. The tag SNPs are a minimum set of SNPs that can distinguish account at least 90% of overall haplotype diversity A set of pre-selected tag SNPs is included in the analysis and pre-stored in the file TestPreSNP.dat , The program will output all intermediate results, all possible sets of tag SNPs and haplotype patterns. The tag SNPs are labeld by their indexes (starting from 0). No permutations are performed. Please find the setting of parameters and results from the following files:
The file TestParSet-003.dat contains the setting of parameters.
The file TestBlockRes-003.dat contains the information of blocks and tag SNPs.
The file TestHapPat-003.dat contains the haplotype patterns in each block.
The parametric dynamic programming algorithm is used to find a block partition with minimum number of tag SNPs that can cover 60% of genome on the basis of gneotype data from unrelated individuals. The haplotype frequencies estimated from the PL-EM algorithm is used. A set of consecutive SNPs with size one or more are defined as a block only if the percentage of common haplotypes is more than 80%. The tag SNPs are a minimum set of SNPs that can distinguish account at least 80% of overall haplotype entropy. The program will output all intermediate results, all possible sets of tag SNPs and haplotype patterns. The tag SNPs are labeld by their names provided from file TestSNPName.dat. The position of a SNP is defined in the file TestSNPSimuPos.dat. Please find the setting of parameters and results from the following files:
The file TestParSet-004.dat contains the setting of parameters.
The file TestBlockRes-004.dat contains the information of blocks and tag SNPs.
The file TestHapPat-004.dat contains the haplotype patterns in each block.
The two-dimension dynamic programming algorithm is used to find block partition with 40 tag SNPs that can cover maximum fraction of genome on the basis of all the possible blocks and the corresponding number of tag SNPs. The blocks and the number of tag SNPs are in the file TestBT.dat, which is extracted from another file TestBlockRes-001.dat. The position of a SNP is defined as its index and stored in a file. Please find the setting of parameters and results from the following files:
The file TestParSet-005.dat contains the setting of parameters.
The file TestBlockRes-005.dat contains the information of blocks and tag SNPs.
References
Clayton D (2001) http://www.nature.com/ng/journal/v29/n2/extref/ng1001-233-S10.pdf .
Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D (2002) The structure of haplotype blocks in the human genome. Science 296: 2225-2229.
Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR, Kautzer CR, Lee D H, Marjoribanks C, McDonough DP, et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 1719-1723.
Qin Z, Nu T, Liu J (2002) Partitioning-Ligation-Expectation-Maximization Algorithm for haplotype inference with single-nucleotide Ploymorphisms. Am J Hum Genet 71: 1242-1247.
Wang N, Akey JM, Zhang K, Chakraborty K and Jin L (2002) Distribution of recombination crossovers and the origin of haplotype blocks: The interplay of population history, recombination, and mutation. Am J Hum Genet 71: 1227-1234.
Zhang K, Deng M, Chen T, Waterman MS and Sun F (2002) A dynamic programming algorithm for haplotype partitioning. Proc Natl Acad Sci USA 99: 7335-7339.
Zhang K, Qin Z, Liu JS, Chen T, Waterman MS, Sun F (2003) Haplotype Block Partitioning and Tag SNP Selection Using Genotype Data and Their Applications to Association Studies . Genome Res 14:908-916.
Old Programs
If you would like to use the old program (only deal with haplotye data) and browse the supplemental materials of our paper in Proc. Natl. Acad. Sci. USA 99: 7335-7339 , please go to here.
If you would like to use another old program that can deal with genotype data from unrelated individuals but only has a few options for block pratitioning and tag SNP selection, please go to here.
We are planning to incorporate more methods in our program. You are welcome to provide new methods that you want us to implement into this program. We greatly appreciate if you could point out any bugs when you use our program. Our contact information is:
Kui Zhang, Ph.D.
Department of Mathematical Sciences
Michigan Technological University
1400 Townsend Drive
Houghton, Michigan 49931
Phone: 906-487-2918
Fax: 906-487-3133
Email: kuiz@mtu.edu
or
Fengzhu Sun, Ph.D.
Program in Molecular and Computational Biology
Department of Biological Sciences
University of Southern California
Los Angeles, CA, 90089
(213) 740-2413 (phone)
(213) 740-8631 (fax)
Email: fsun@hto.usc.edu
Homepage: http://www-rcf.usc.edu/~fsun/
Created Date: March 20, 2003
Last Updated Date: Sep 17, 2015