hapLOH
Introduction
hapLOH is software for combining haplotype estimates and SNP array data for identifying somatic segmental copy number and copy-neutral mutations. Common applications include detection of tumor-associated mutations from low-purity tumor-normal mixture samples, and discovery of clonal aberrations in non-malignant tissues.
To obtain the software, use the contact information at the end of the page.
Installation
hapLOH is a binary executable that depends on Perl and Python interpreters. hapLOH requires
-Python 2.7 (incompatible with Python 3.x)
-Perl 5.8
Releases are available for MacOS and Linux. The software is distributed as a compressed tarball. To install hapLOH with version number VERSION, move it to the directory in which you wish to install it and unpack it:
tar zxf hapLOH-VERSION.tgz
This will create a directory named hapLOH-VERSION. The executable file is hapLOH-VERSION/bin/haploh.
Quickstart Examples
If you don't add the haploh executable to your path, then replace haploh below with the appropriate path to the executable.
The download includes a directory EXAMPLES/ which contains few small example input files (BAF files and haplotype estimate files) that you can use to test the program. This most basic command runs the localization HMM using mean event length 20 Mb and event prevalence 0.1 and writes the output files to the working directory:
haploh --baf example05.bafs --phased example05.hapguess
We recommend that you set the event length and event prevalence according to the expected characteristics of your samples; see usage notes below for guidance on how to choose parameters. You can organize your output by specifying an output directory, which will be created if it doesn't already exist.
haploh --baf EXAMPLES/example05.bafs --phased EXAMPLES/example05.hapguess event_mb 10 --event_prevalence 0.05 --destdir Output_10mb_prev05 --random_seed 1234
Input Files
Required
Basic input includes two files, one for the BAFs and one for the phased genotypes, in the formats described below. Each file should contain data for one individual only. Markers should be ordered by genomic position, and paired files should contain data for exactly the same markers (missing values are allowed).
BAF file
A single-line, space-delimited file containing BAFs in genomic position order. Missing values may be denoted by '?'; numerical values outside of the range [0,1] will also be considered missing.
statistical haplotype file
A two-line or two-column, space-delimited file with rows (or columns) corresponding to haplotypes. Alleles should be coded as A/B and missing values should be denoted by '?'.
Optional
To estimate the overrepresented haplotype, in addition to the hapLOH inputs above, you will need to supply a file describing the switch rates for the phase estimates. Note that this file is required if you use the --hapid flag.
switch rate file
Either
(1) A file with a single value for the average switch accuracy of the statistical haplotype estimates, or
(2) A one-line, space-delimited file representing interval-specific switch probabilities, where an interval is the interval between consecutive informative markers or the interval leading to the first informative markers (for unordered haplotypes, use 0.5 as the value for the leading interval). The number of switch probabilities should equal the number of informative markers.
Command Line Options
Required
Recommended
FLAG
--event_mb
--event_prevalence
--num_event_states
--destdir
--logfile
--help
--random_seed
ARGUMENT RANGE
DESCRIPTION
expected genomic size (in megabases) of imbalance events [20]
expected fraction of genome that is imbalanced; used to determine HMM parameters, but is not deterministic especially when --tpm estimate is used
the number of imbalance states for the localization HMM [2]
directory into which output files will be written; will be created if it doesn't already exist [.]
file to which log will be written
prints list of command line options
numeric seed used for random number generation [clocktime, printed to log file]
>0
[0,1]
integer >=1
directory name
filename
no arg.
integer
Advanced
FLAG
--tpm
--genome_mb
--event_alpha_range
--num_starts
--mean_informative_marker_count
--max_iterations
--gamma
--initial_alphas
--no-localization
--hapid
--datadir
ARGUMENT RANGE
(fixed, estimate)
>0
comma-separated list of length 2
integer
>0
integer
integer
comma-separated list of length (num_event_states+1)
no arg.
no arg.
DESCRIPTION
specifies whether to use fixed values for transition probability matrix, or to estimate them [estimate]
genomic size of region covered by marker set; used in conjunction with event_mb, event_prevalance, and observed informative marker count to calculate transition parameters [3156 (estimate based on hg19 genome, appropriate for whole-genome arrays)]
range that is used to determine grid of start values for alpha estimation. If not specified, this range is calculated internally as [genomewide average phase concordance, 0.95]
number of starts for parameter estimation [2]
using this flag overrides event_mb and genome_mb; use this to specify event length in terms of number of informative markers
maximum number of iterations per EM start [30]
weight parameter for pseudocounts used to stabilize TPM estimation [1000]
Alpha values for each hidden state in localization HMM; if specified, EM is not performed and no parameters are estimated; posterior probabilities will be calculated given the specified parameters
turns off the localization HMM (just outputs .switch_enumeration)
directory name
turns on HMM for estimating overrepresented haplotype
directory where intermediate may be written or found [DESTDIR/intermediates/]
Output Files
If the --baf option is used, the prefix of the BAF input file is used as the prefix for the output files (i.e. test.baf -> test.switch_enumeration).
Note: A directory intermediates/ will be created containing symbolic links to the input files. This is simply to accomodate the current implementation, and may be deleted after running.
Basic Usage Notes
There are three basic applications for hapLOH --- localization, testing a specific region of interest, and estimation of the over-represented haplotype.
Localization
This is the most common use of hapLOH and is the default procedure. Two important options are the --event_mb and --event_prevalence options. Although they have default values and therefore are not required, we suggest the user consider specifying these according to the expected size of the events of interest and the characteristics of the sample. These values will be used to determine the transition probabilities. A few guidelines when choosing parameters:
when the TPM is estimated (the default behavior), the values are much less influential than when using a fixed TPM. In the estimate mode, the user may also adjust the gamma value, which may be interpreted as a weight on the expected event size. The lower the weight, the more the observed data affect the transition parameters.
erring on the side of lower prevalence tends to produce qualitatively the same results as using a slightly higher prevalence, with less background noise.
since the transition probabilities are constant across sites, the event size distribution will be geometric with parameter (1/expected event size). So any event size will allow the detection of larger and smaller events to some degree. Choosing a very small value for --event_mb may produce noisy results.
Detection
hapLOH currently does not include an option for assessing the evidence of allelic imbalance in a specific region, but you can do it yourself using a few of the output files. To perform detection (i.e. testing a specific region for deviation from the null phase concordance rate), you will need to select values from the .switch_enumeration file that correspond to your region of interest. First determine which markers in your dataset are located in the region of interest. Then use the .informative file to determine which of those markers are informative (note that indices are 0-based). Since the values in the .switch_enumeration file correspond to every consecutive pair of informative markers, there will be one fewer value than number of informative markers. Drop the last informative marker in the region of interest and select the values corresponding to the remaining informative markers -- the average of these will be the observed phase concordance rate for the region.
The localization HMM is run by default, but if you are only interested in detection you can turn it off with the command line flag --no-localization.
Estimation of the Overrepresented Haplotype
To apply the HMM for estimating the over- and underrepresented haplotypes, invoke hapLOH with the flag --hapid. The algorithm requires output from the localization HMM, so requesting --no-localization with --hapid currently produces a warning and quits without running.
This procedure produces ordered haplotypes covering all of the markers in the dataset, but note that order is only meaningful when imbalance exists.
Post-processing
We have a set of working scripts (mostly in R, some in Perl) for various of the common next steps for summarizing and making inference from hapLOH output. Here's a partial list of procedures for which we have written scripts.
apply a threshold to posterior probabilities to call discrete event regions
Draw plots of BAF, LRR, and hapLOH results for specific regions or for entire samples
Calculate median BAF and LRR deviations for specified regions
Advanced Usage Notes
You might find the advanced options useful for testing specific aspects of the method or for specialized cases in which you want to control the HMM parameters. See the table of available options above.
FAQs
Is there a "paired" mode?
hapLOH relies on the genotype calls from the sample being representative of the germline genotypes, In the case of samples with tumor purity higher than about 25%, the genotype calls in imbalance regions may be no-calls or may be called as homozygous and will be uninformative, and hapLOH will not recognize the region as imbalanced. In this case, using the genotype calls from a paired normal sample, if available, will restore the informative genotypes. There is no special "paired" mode; simply specify a file containing haplotype estimates made from the normal genotypes as the --phased file, and use the tumor sample BAFs as the --baf file.
Does hapLOH work on Affy and Illumina data?
Yes. To apply hapLOH to Affy data, you'll need to generate B allele frequencies, which you can do from .CEL files using the Affymetrix Power Tools (APT) and PennCNV softwares. This PennCNV page has a nice step-by-step guide for downloading and setting up APT and PennCNV and using them to convert .CEL files into genotype calls, BAFs, and LRRs.
Contact and Reference
If you have any questions or comments, please contact Selina at svattathil@utexas.edu .
hapLOH is an implementation of the method described in
Vattathil, Selina, and Paul Scheet. "Haplotype-based profiling of subtle allelic imbalance with SNP arrays." Genome research 23.1 (2013): 152-158. (link)