Software‎ > ‎


hapLOH is software for combining haplotype estimates and SNP array data for identifying somatic segmental copy number and copy-neutral mutations.  Common applications include detection of tumor-associated mutations from low-purity tumor-normal mixture samples, and discovery of clonal aberrations in non-malignant tissues. 

To obtain the software, use the contact information at the end of the page. 


hapLOH is a binary executable that depends on Perl and Python interpreters. hapLOH requires
    -Python 2.7 (incompatible with Python 3.x)
    -Perl 5.8

Releases are available for MacOS and Linux.  The software is distributed as a compressed tarball. To install hapLOH with version number VERSION, move it to the directory in which you wish to install it and unpack it:

tar zxf hapLOH-VERSION.tgz

This will create a directory named hapLOH-VERSION. The executable file is hapLOH-VERSION/bin/haploh.

Quickstart Examples

If you don't add the haploh executable to your path, then replace haploh below with the appropriate path to the executable. 

The  download includes a directory EXAMPLES/ which contains few small example input files (BAF files and haplotype estimate files) that you can use to test the program.  This most basic command runs the localization HMM using mean event length 20 Mb and event prevalence 0.1 and writes the output files to the working directory:

haploh --baf example05.bafs --phased example05.hapguess

We recommend that you set the event length and event prevalence according to the expected characteristics of your samples; see usage notes below for guidance on how to choose parameters.  You can organize your output by specifying an output directory, which will be created if it doesn't already exist. 

haploh --baf EXAMPLES/example05.bafs --phased EXAMPLES/example05.hapguess event_mb 10 --event_prevalence 0.05 --destdir Output_10mb_prev05 --random_seed 1234

Input Files


Basic input includes two files, one for the BAFs and one for the phased genotypes, in the formats described below.  Each file should contain data for one individual only.  Markers should be ordered by genomic position, and paired files should contain data for exactly the same markers (missing values are allowed).

BAF file
A single-line, space-delimited file containing BAFs in genomic position order. Missing values may be denoted by '?'; numerical values outside of the range [0,1] will also be considered missing.

statistical haplotype file
A two-line or two-column, space-delimited file with rows (or columns) corresponding to haplotypes. Alleles should be coded as A/B and missing values should be denoted by '?'.


To estimate the overrepresented haplotype, in addition to the hapLOH inputs above, you will need to supply a file describing the switch rates for the phase estimates.  Note that this file is required if you use the --hapid flag. 

switch rate file
(1) A file with a single value for the average switch accuracy of the statistical haplotype estimates, or
(2) A one-line, space-delimited file representing interval-specific switch probabilities, where an interval is the interval between consecutive informative markers or the interval leading to the first informative markers (for unordered haplotypes, use 0.5 as the value for the leading interval). The number of switch probabilities should equal the number of informative markers.

Command Line Options


 --baf filenamefile containing BAFs in format described above
 --phased filenamefile containing statistical haplotype estimates in format described above


expected genomic size (in megabases) of imbalance events [20]
  --event_prevalence [0,1]expected fraction of genome that is imbalanced; used to determine HMM parameters, but is not deterministic especially when --tpm estimate is used
 --num_event_states integer >=1
 the number of imbalance states for the localization HMM [2]
 --destdir directory name
 directory into which output files will be written; will be created if it doesn't already exist [.]
 --logfile filename file to which log will be written
 --help no arg.
 prints list of command line options
 --random_seed integer numeric seed used for random number generation [clocktime, printed to log file]


 --tpm (fixed, estimate)
 specifies whether to use fixed values for transition probability matrix, or to estimate them [estimate]
 genomic size of region covered by marker set; used in conjunction with event_mb, event_prevalance, and observed informative marker count to calculate transition parameters [3156 (estimate based on hg19 genome, appropriate for whole-genome arrays)]
 --event_alpha_rangecomma-separated list of length 2
 range that is used to determine grid of start values for alpha estimation.  If not specified, this range is calculated internally as [genomewide average phase concordance, 0.95]
 --num_starts integer number of starts for parameter estimation [2]
 --mean_informative_marker_count >0using this flag overrides event_mb and genome_mb; use this to specify event length in terms of number of informative markers
 --max_iterations integer maximum number of iterations per EM start [30]
--gamma integer weight parameter for pseudocounts used to stabilize TPM estimation [1000]
 --initial_alphascomma-separated list of length (num_event_states+1)
 Alpha values for each hidden state in localization HMM; if specified, EM is not performed and no parameters are estimated; posterior probabilities will be calculated given the specified parameters
 --no-localization no arg. 
 turns off the localization HMM (just outputs .switch_enumeration)
 --hapid no arg.
 turns on HMM for estimating overrepresented haplotype
 --datadir directory name
 directory where intermediate may be written or found [DESTDIR/intermediates/]

Output Files

If the --baf option is used, the prefix of the BAF input file is used as the prefix for the output files (i.e.  test.baf -> test.switch_enumeration). 
Note:  A directory intermediates/ will be created containing symbolic links to the input files.  This is simply to accomodate the current implementation, and may be deleted after running.

basic output                                    
 .summary gives the number of sites with missing data, the number of informative sites, and some other summary information
 .informative 0-based indices of the informative sites that were used to determine phase concordance; useful for mapping results back to genomic regions
 .baf_phased_haplotypes header indicates number of individuals processed and total number of markers; first and second lines give over- and underrepresented haplotypes as determined by BAF thresholding
 .switch_enumerationphase concordance indicator at each interval (0=switch, 1-concordance)
 output from localization HMM 
 .postprobs output of localization HMM; conditional probability for each hidden state at each interval
 .finalparams final emission probability and transition probability values for the localization HMM
 .EM_log parameter value estimates at each EM iteration; star indicates that emission probability estimates reached convergence
 output from ordering HMM      
 .excesshap_haps two-line file; first and second lines give over- and underrepresented haplotypes determined by HMM

Basic Usage Notes

There are three basic applications for hapLOH --- localization, testing a specific region of interest, and estimation of the over-represented haplotype. 


This is the most common use of hapLOH and is the default procedure.  Two important options are the --event_mb and --event_prevalence options.  Although they have default values and therefore are not required, we suggest the user consider specifying these according to the expected size of the events of interest and the characteristics of the sample.  These values will be used to determine the transition probabilities.  A few guidelines when choosing parameters:
  • when the TPM is estimated (the default behavior), the values are much less influential than when using a fixed TPM.  In the estimate mode, the user may also adjust the gamma value, which may be interpreted as a weight on the expected event size.  The lower the weight, the more the observed data affect the transition parameters.
  • erring on the side of lower prevalence tends to produce qualitatively the same results as using a slightly higher prevalence, with less background noise.
  • since the transition probabilities are constant across sites, the event size distribution will be geometric with parameter (1/expected event size).  So any event size will allow the detection of larger and smaller events to some degree.  Choosing a very small value for --event_mb may produce noisy results. 


hapLOH currently does not include an option for assessing the evidence of allelic imbalance in a specific region, but you can do it yourself using a few of the output files.  To perform detection (i.e. testing a specific region for deviation from the null phase concordance rate), you will need to select values from the .switch_enumeration file that correspond to your region of interest.  First determine which markers in your dataset are located in the region of interest.  Then use the .informative file to determine which of those markers are  informative (note that indices are 0-based).  Since the values in the .switch_enumeration file correspond to every consecutive pair of informative markers, there will be one fewer value than number of informative markers.  Drop the last informative marker in the region of interest and select the values corresponding to the remaining informative markers  --  the average of these will be the observed phase concordance rate for the region.

 The localization HMM is run by default, but if you are only interested in detection you can turn it off with the command line flag --no-localization.

Estimation of the Overrepresented Haplotype

To apply the HMM for estimating the over- and underrepresented haplotypes, invoke hapLOH with the flag --hapidThe algorithm requires output from the localization HMM, so requesting --no-localization with --hapid currently produces a warning and quits without running.  

This procedure produces ordered haplotypes covering all of the markers in the dataset, but note that order is only meaningful when imbalance exists.


We have a set of working scripts (mostly in R, some in Perl) for various of the common next steps for summarizing and making inference from hapLOH output.  Here's a partial list of procedures for which we have written scripts.

  1. apply a threshold to posterior probabilities to call discrete event regions
  2. Draw plots of BAF, LRR, and hapLOH results for specific regions or for entire samples
  3. Calculate median BAF and LRR deviations for specified regions

Advanced Usage Notes

You might find the advanced options useful for testing specific aspects of the method or for specialized cases in which you want to control the HMM parameters.  See the table of available options above.  


  • Is there a "paired" mode?

hapLOH relies on the genotype calls from the sample being representative of the germline genotypes,  In the case of samples with tumor purity higher than about 25%, the genotype calls in imbalance regions may be no-calls or may be called as homozygous and will be uninformative, and hapLOH will not recognize the region as imbalanced.  In this case, using the genotype calls from a paired normal sample, if available, will restore the informative genotypes.  There is no special "paired" mode; simply specify a file containing haplotype estimates made from the normal genotypes as the --phased file, and use the tumor sample BAFs as the --baf file.

  • Does hapLOH work on Affy and Illumina data?

Yes.  To apply hapLOH to Affy data, you'll need to generate B allele frequencies, which you can do from .CEL files using the Affymetrix Power Tools (APT) and PennCNV softwaresThis PennCNV page has a nice step-by-step guide for downloading and setting up APT and PennCNV and using them to convert .CEL files into genotype calls, BAFs, and LRRs. 

Contact and Reference

If you have any questions or comments, please contact Selina at .

hapLOH is an implementation of the method described in
Vattathil, Selina, and Paul Scheet. "Haplotype-based profiling of subtle allelic imbalance with SNP arrays." Genome research 23.1 (2013): 152-158.      (link)