hapLOH

Introduction

hapLOH is software for combining haplotype estimates and SNP array data for identifying somatic segmental copy number and copy-neutral mutations.  Common applications include detection of tumor-associated mutations from low-purity tumor-normal mixture samples, and discovery of clonal aberrations in non-malignant tissues. 

To obtain the software, use the contact information at the end of the page. 

Installation

hapLOH is a binary executable that depends on Perl and Python interpreters. hapLOH requires

    -Python 2.7 (incompatible with Python 3.x)

    -Perl 5.8

Releases are available for MacOS and Linux.  The software is distributed as a compressed tarball. To install hapLOH with version number VERSION, move it to the directory in which you wish to install it and unpack it:

tar zxf hapLOH-VERSION.tgz

This will create a directory named hapLOH-VERSION. The executable file is hapLOH-VERSION/bin/haploh.

Quickstart Examples

If you don't add the haploh executable to your path, then replace haploh below with the appropriate path to the executable. 

The  download includes a directory EXAMPLES/ which contains few small example input files (BAF files and haplotype estimate files) that you can use to test the program.  This most basic command runs the localization HMM using mean event length 20 Mb and event prevalence 0.1 and writes the output files to the working directory:

haploh --baf example05.bafs --phased example05.hapguess

We recommend that you set the event length and event prevalence according to the expected characteristics of your samples; see usage notes below for guidance on how to choose parameters.  You can organize your output by specifying an output directory, which will be created if it doesn't already exist. 

haploh --baf EXAMPLES/example05.bafs --phased EXAMPLES/example05.hapguess event_mb 10 --event_prevalence 0.05 --destdir Output_10mb_prev05 --random_seed 1234

Input Files

Required

Basic input includes two files, one for the BAFs and one for the phased genotypes, in the formats described below.  Each file should contain data for one individual only.  Markers should be ordered by genomic position, and paired files should contain data for exactly the same markers (missing values are allowed).

BAF file

A single-line, space-delimited file containing BAFs in genomic position order. Missing values may be denoted by '?'; numerical values outside of the range [0,1] will also be considered missing.

statistical haplotype file

A two-line or two-column, space-delimited file with rows (or columns) corresponding to haplotypes. Alleles should be coded as A/B and missing values should be denoted by '?'.

Optional

To estimate the overrepresented haplotype, in addition to the hapLOH inputs above, you will need to supply a file describing the switch rates for the phase estimates.  Note that this file is required if you use the --hapid flag. 

switch rate file

Either

(1) A file with a single value for the average switch accuracy of the statistical haplotype estimates, or

(2) A one-line, space-delimited file representing interval-specific switch probabilities, where an interval is the interval between consecutive informative markers or the interval leading to the first informative markers (for unordered haplotypes, use 0.5 as the value for the leading interval). The number of switch probabilities should equal the number of informative markers.

Command Line Options

Required

Recommended

 FLAG

 --event_mb

  --event_prevalence

 --num_event_states

 --destdir

 --logfile

 --help

 --random_seed

  ARGUMENT RANGE

 

 DESCRIPTION

expected genomic size (in megabases) of imbalance events [20]

expected fraction of genome that is imbalanced; used to determine HMM parameters, but is not deterministic especially when --tpm estimate is used

 the number of imbalance states for the localization HMM [2]

 directory into which output files will be written; will be created if it doesn't already exist [.]

 file to which log will be written

 prints list of command line options

 numeric seed used for random number generation [clocktime, printed to log file]

>0

 [0,1]

 integer >=1

 directory name

 filename

 no arg.

 integer

Advanced

FLAG

 --tpm

 --genome_mb

 --event_alpha_range

 --num_starts

 --mean_informative_marker_count

 --max_iterations

--gamma

 --initial_alphas

 --no-localization

 --hapid

 --datadir

 ARGUMENT RANGE

 (fixed, estimate)

>0

comma-separated list of length 2

 integer

 >0

 integer

 integer

comma-separated list of length (num_event_states+1)

 no arg.

 no arg.

 DESCRIPTION

 specifies whether to use fixed values for transition probability matrix, or to estimate them [estimate]

 genomic size of region covered by marker set; used in conjunction with event_mb, event_prevalance, and observed informative marker count to calculate transition parameters [3156 (estimate based on hg19 genome, appropriate for whole-genome arrays)]

 range that is used to determine grid of start values for alpha estimation.  If not specified, this range is calculated internally as [genomewide average phase concordance, 0.95]

 number of starts for parameter estimation [2]

using this flag overrides event_mb and genome_mb; use this to specify event length in terms of number of informative markers

 maximum number of iterations per EM start [30]

 weight parameter for pseudocounts used to stabilize TPM estimation [1000]

 Alpha values for each hidden state in localization HMM; if specified, EM is not performed and no parameters are estimated; posterior probabilities will be calculated given the specified parameters

 

 turns off the localization HMM (just outputs .switch_enumeration)

 directory name

 turns on HMM for estimating overrepresented haplotype

 directory where intermediate may be written or found [DESTDIR/intermediates/]

Output Files

If the --baf option is used, the prefix of the BAF input file is used as the prefix for the output files (i.e.  test.baf -> test.switch_enumeration). 

Note:  A directory intermediates/ will be created containing symbolic links to the input files.  This is simply to accomodate the current implementation, and may be deleted after running.

Basic Usage Notes

There are three basic applications for hapLOH --- localization, testing a specific region of interest, and estimation of the over-represented haplotype. 

Localization

This is the most common use of hapLOH and is the default procedure.  Two important options are the --event_mb and --event_prevalence options.  Although they have default values and therefore are not required, we suggest the user consider specifying these according to the expected size of the events of interest and the characteristics of the sample.  These values will be used to determine the transition probabilities.  A few guidelines when choosing parameters:

Detection

hapLOH currently does not include an option for assessing the evidence of allelic imbalance in a specific region, but you can do it yourself using a few of the output files.  To perform detection (i.e. testing a specific region for deviation from the null phase concordance rate), you will need to select values from the .switch_enumeration file that correspond to your region of interest.  First determine which markers in your dataset are located in the region of interest.  Then use the .informative file to determine which of those markers are  informative (note that indices are 0-based).  Since the values in the .switch_enumeration file correspond to every consecutive pair of informative markers, there will be one fewer value than number of informative markers.  Drop the last informative marker in the region of interest and select the values corresponding to the remaining informative markers  --  the average of these will be the observed phase concordance rate for the region.

 The localization HMM is run by default, but if you are only interested in detection you can turn it off with the command line flag --no-localization.

Estimation of the Overrepresented Haplotype

To apply the HMM for estimating the over- and underrepresented haplotypes, invoke hapLOH with the flag --hapid.  The algorithm requires output from the localization HMM, so requesting --no-localization with --hapid currently produces a warning and quits without running.  

This procedure produces ordered haplotypes covering all of the markers in the dataset, but note that order is only meaningful when imbalance exists.

Post-processing

We have a set of working scripts (mostly in R, some in Perl) for various of the common next steps for summarizing and making inference from hapLOH output.  Here's a partial list of procedures for which we have written scripts.

Advanced Usage Notes

You might find the advanced options useful for testing specific aspects of the method or for specialized cases in which you want to control the HMM parameters.  See the table of available options above.  

FAQs

hapLOH relies on the genotype calls from the sample being representative of the germline genotypes,  In the case of samples with tumor purity higher than about 25%, the genotype calls in imbalance regions may be no-calls or may be called as homozygous and will be uninformative, and hapLOH will not recognize the region as imbalanced.  In this case, using the genotype calls from a paired normal sample, if available, will restore the informative genotypes.  There is no special "paired" mode; simply specify a file containing haplotype estimates made from the normal genotypes as the --phased file, and use the tumor sample BAFs as the --baf file.

Yes.  To apply hapLOH to Affy data, you'll need to generate B allele frequencies, which you can do from .CEL files using the Affymetrix Power Tools (APT) and PennCNV softwares.  This PennCNV page has a nice step-by-step guide for downloading and setting up APT and PennCNV and using them to convert .CEL files into genotype calls, BAFs, and LRRs. 

Contact and Reference

If you have any questions or comments, please contact Selina at svattathil@utexas.edu .

hapLOH is an implementation of the method described in

Vattathil, Selina, and Paul Scheet. "Haplotype-based profiling of subtle allelic imbalance with SNP arrays." Genome research 23.1 (2013): 152-158.      (link)