hapLOH is software for combining haplotype estimates and SNP array data for identifying somatic segmental copy number and copy-neutral mutations. Common applications include detection of tumor-associated mutations from low-purity tumor-normal mixture samples, and discovery of clonal aberrations in non-malignant tissues.
To obtain the software, use the contact information at the end of the page.
hapLOH is a binary executable that depends on Perl and
Python interpreters. hapLOH requires
-Python 2.7 (incompatible with Python 3.x)
Releases are available for MacOS and Linux. The
software is distributed as a compressed tarball. To install hapLOH with version
number VERSION, move it to the directory in which you wish to install it and unpack it:
ar zxf hapLOH-VERSION.tgz
This will create a directory named
hapLOH-VERSION. The executable file is
If you don't add the haploh executable to your path, then replace
haploh below with the appropriate path to the executable.
The download includes a directory
EXAMPLES/ which contains few small example input files (BAF files and haplotype estimate files) that you can use to test the program. This most basic command runs the localization HMM using mean event length 20 Mb and event prevalence 0.1 and writes the output files to the working directory:
haploh --baf example05.bafs --phased example05.hapguess
We recommend that you set the event length and event prevalence according to the expected characteristics of your samples; see usage notes below for guidance on how to choose parameters. You can organize your output by specifying an output directory, which will be created if it doesn't already exist.
haploh --baf EXAMPLES/example05.bafs --phased EXAMPLES/example05.hapguess event_mb 10 --event_prevalence 0.05 --destdir Output_10mb_prev05 --random_seed 1234
Basic input includes two files, one for the BAFs and one for the phased genotypes, in the formats described below. Each file should
contain data for one individual only.
Markers should be ordered by genomic position, and paired files should
contain data for exactly the same markers (missing values are allowed).BAF file
A single-line, space-delimited file containing BAFs in genomic position order. Missing values may be denoted by '?'; numerical values outside of the range [0,1] will also be considered missing.
statistical haplotype file
A two-line or two-column, space-delimited file with rows (or columns) corresponding to haplotypes. Alleles should be coded as A/B and missing values should be denoted by '?'.
estimate the overrepresented haplotype, in addition to the hapLOH inputs
above, you will need to supply a file describing the switch rates for the phase
estimates. Note that this file is required if you use the
switch rate file
(1) A file with a single value for the average switch accuracy of the statistical haplotype estimates, or
(2) A one-line, space-delimited file representing interval-specific switch probabilities, where an interval is the interval between consecutive informative markers or the interval leading to the first informative markers (for unordered haplotypes, use 0.5 as the value for the leading interval). The number of switch probabilities should equal the number of informative markers.
Command Line Options
| FLAG|| ARGUMENT RANGE|| DESCRIPTION|
| filename||file containing BAFs in format described above|
| filename||file containing statistical haplotype estimates in format described above|
| FLAG|| ARGUMENT RANGE|| DESCRIPTION|
| ||expected genomic size (in megabases) of imbalance events |
| [0,1]||expected fraction of genome that is imbalanced; used to determine HMM parameters, but is not deterministic especially when |
--tpm estimate is used
| integer >=1|| the number of imbalance states for the localization HMM |
| directory name|| directory into which output files will be written; will be created if it doesn't already exist [.]|
| filename|| file to which log will be written|
| no arg.|| prints list of command line options|
| integer|| numeric seed used for random number generation [clocktime, printed to log file]|
|FLAG|| ARGUMENT RANGE|| DESCRIPTION|
| (fixed, estimate)|| specifies whether to use fixed values for transition probability matrix, or to estimate them [estimate]|
|>0|| genomic size of region covered by marker set; used in conjunction with |
event_prevalance, and observed informative marker count to calculate transition parameters [3156 (estimate based on hg19 genome, appropriate for whole-genome arrays)]
|comma-separated list of length 2 || range that is used to determine grid of start values for alpha estimation. If not specified, this range is calculated internally as [genomewide average phase concordance, 0.95]|
| integer|| number of starts for parameter estimation |
| >0||using this flag overrides |
genome_mb; use this to specify event length in terms of number of informative markers
| integer|| maximum number of iterations per EM start |
| integer|| weight parameter for pseudocounts used to stabilize TPM estimation |
|comma-separated list of length (|
| Alpha values for each hidden state in localization HMM; if specified, EM is not performed and no parameters are estimated; posterior probabilities will be calculated given the specified parameters|
| no arg.|| |
| turns off the localization HMM (just outputs |
| no arg.|| turns on HMM for estimating overrepresented haplotype|
| directory name|| directory where intermediate may be written or found [DESTDIR/intermediates/]|
If the --baf option is used, the prefix of the BAF input file is used as the prefix for the output files (i.e. test.baf -> test.switch_enumeration).
Note: A directory
will be created containing symbolic links to the input files. This is simply to accomodate the current implementation, and may be deleted after running.
| SUFFIX|| DESCRIPTION|
|basic output || |
| gives the number of sites with missing data, the number of informative sites, and some other summary information |
| 0-based indices of the informative sites that were used to determine phase concordance; useful for mapping results back to genomic regions|
| header indicates number of individuals processed and total number of markers; first and second lines give over- and underrepresented haplotypes as determined by BAF thresholding|
|phase concordance indicator at each interval (0=switch, 1-concordance)|
| output from localization HMM || |
| output of localization HMM; conditional probability for each hidden state at each interval|
| final emission probability and transition probability values for the localization HMM|
| parameter value estimates at each EM iteration; star indicates that emission probability estimates reached convergence|
| output from ordering HMM || |
| two-line file; first and second lines give over- and underrepresented haplotypes determined by HMM|
There are three basic applications for hapLOH --- localization, testing a specific region of interest, and estimation of the over-represented haplotype.
This is the most common use of hapLOH and is the default procedure. Two important options are the
options. Although they have default values and therefore are not
required, we suggest the user consider specifying these according to the
expected size of the events of interest and the characteristics of the
sample. These values will be used to determine the transition
probabilities. A few guidelines when choosing parameters:
- when the TPM is estimated (the default behavior), the values are much less influential than when using a fixed TPM. In the estimate mode, the user may also adjust the
gamma value, which may be
interpreted as a weight on the expected event size. The lower the weight, the more the observed
data affect the transition parameters.
- erring on
the side of lower prevalence tends to produce qualitatively the same
results as using a slightly higher prevalence, with less background
- since the transition probabilities are constant across
sites, the event size distribution will be geometric with parameter
(1/expected event size). So any event size will allow the detection of
larger and smaller events to some degree. Choosing a very small value
--event_mb may produce noisy results.
hapLOH currently does not include an option for assessing the evidence of allelic imbalance in a specific region, but you can do it yourself using a few of the output files. To perform detection (i.e. testing a specific region for
deviation from the null phase concordance rate), you will need to select values
from the .switch_enumeration file that correspond to your region of
interest. First determine which markers in
your dataset are located in the region of interest. Then use the .informative file to determine
which of those markers are informative
(note that indices are 0-based). Since
the values in the .switch_enumeration file correspond to every consecutive pair
of informative markers, there will be one fewer value than number of
informative markers. Drop the last informative marker in the region of interest and select
the values corresponding to the remaining informative markers -- the
average of these will be the observed phase concordance rate for the region.
The localization HMM is run by default, but if you are only interested in detection you can turn it off with the command line flag
Estimation of the Overrepresented Haplotype
To apply the HMM for estimating the over- and underrepresented haplotypes, invoke hapLOH with the flag
--hapid. The algorithm requires output from the localization HMM,
--hapid currently produces
a warning and quits without running.
This procedure produces ordered haplotypes covering all of the markers in the dataset, but note that order is only
We have a set of working scripts (mostly in R, some in Perl) for various of the common next steps for summarizing and making inference from hapLOH output. Here's a partial list of procedures for which we have written scripts.
- apply a threshold to posterior probabilities to call discrete event regions
- Draw plots of BAF, LRR, and hapLOH results for specific regions or for entire samples
- Calculate median BAF and LRR deviations for specified regions
Advanced Usage Notes
You might find the advanced options useful for testing specific aspects of the method or for specialized cases in which you want to control the HMM parameters. See the table of available options above.
- Is there a "paired" mode?
hapLOH relies on the genotype calls from the sample being representative
of the germline genotypes, In the case of samples with tumor purity
higher than about 25%, the genotype calls in imbalance regions may be
no-calls or may be called as homozygous and will be uninformative, and
hapLOH will not recognize the region as imbalanced. In this case, using
the genotype calls from a paired normal sample, if available, will
restore the informative genotypes. There is no special "paired" mode;
simply specify a file containing haplotype estimates made from the normal genotypes as the
--phased file, and use the tumor sample BAFs as the
- Does hapLOH work on Affy and Illumina data?
Yes. To apply hapLOH to Affy data, you'll need to generate B allele frequencies, which you can do from .CEL files using the Affymetrix Power Tools (APT) and PennCNV softwares. This PennCNV page has a nice step-by-step guide for downloading and setting up APT and PennCNV and using them to convert .CEL files into genotype calls, BAFs, and LRRs.
Contact and Reference
If you have any questions or comments, please contact Selina at email@example.com .
hapLOH is an implementation of the method described in
Vattathil, Selina, and Paul Scheet. "Haplotype-based profiling of subtle allelic imbalance with SNP arrays." Genome research
23.1 (2013): 152-158. (link