hapLOHseq

Introduction

haplohseq identifies regions of allelic imbalance (AI) in sequencing data obtained from impure samples where AI events exist in a potentially low proportion of cells in the sample. Input to the software includes a VCF file of genotypes and estimated phased genotypes. The software produces a posterior probability file that includes probabilities of an allelic imbalance event for each heterozygous, polymorphic genomic site contained in the VCF file. Briefly, to estimate the probabilities of AI events, or loss-of-heterozygosity and copy-number-aberration events (LOH and CNA), these are the basic steps:

1. Genotype samples at polymorphic sites
2. Phase genotypes
3. Run haplohseq

Step 1 could be run using your choice of a genotype caller and step 2 can be done with a variety of existing phasing software and reference panels (we provide a companion phasing utility with haplohseq to do this as well - although the method is currently unpublished).

For step 1, we have been running the GATK UnifiedGenotyper for genotype calling.

For step 2, we have been performing simple phasing using a utility (simple_phaser.py that is bundled with haplohseq) that is dependent on a 1000 Genomes reference panel (which is underlying hg19.exome.ldmap which is a dependency for simple_phaser.py), but your choice of data and phasing software is up to you (see required formats in the file descriptions below).

As an alternative, a tutorial for phasing using MACH and the 1000 Genomes can be found at http://csg.sph.umich.edu/abecasis/MACH/tour and downloads for MACH and the 1000 Genomes data can be found at http://csg.sph.umich.edu/abecasis/MACH/download.

haplohseq has been developed for the detection of subtle allelic imbalance events from next-generation sequencing data. haplohseq is a sequencing-based extension of hapLOH (Vattathil and Scheet, Genome Research 2013), which is a method for the detection of subtle allelic imbalance events from SNP array data. haplohseq is capable of identifying events of 10 mega-bases or greater occurring in as little as 16% of the sample using exome sequencing data (at 80x) and 4% using whole genome se-quencing data (at 30x), far exceeding the capabilities of existing software.

Download/installation

- hapLOHseq: Download the Mac OSX or Linux version of the software from http://scheet.org/software.html. Then extract the tar.gz file and verify that the software works by trying the quick start example below. There will be a haplohseq executable in the root directory of the extraction.
- R (for results visualization): If you want to generate the plot using the utility R script plot_haplohseq.R (see quick start example), please install R and the optparse library.

Quick start example

After you download and extract haplohseq_platform.tar.gz, there is an example tumor_exome.vcf file in the example directory that you can process. You can execute example_run.sh from the example directory of the tar.gz extraction to run the complete example.

> tar -xvzf haplohseq_macosx-0.1.2.tar.gz

> cd haplohseq_macosx/example

> ./example_run.sh

This script, executes three main steps:

1. Phases the het sites in tumor_exome.vcf using a utility phasing script (simple_phaser.py) provided with the haplohseq bundle. You can instead use MACH, fastPHASE, BEAGLE or your phasing software of choice. However, there are 2 output files that need to be formatted like they are for MACH. Examples of such files are generated using our simple_phaser.py script and can be found in the example_output directory after step 1 is executed. See tumor_example.hap and tumor_exome.pos.
2. Runs haplohseq to assign AI probabilities across the genome for the test sample. The detailed report that includes this information (tumor_exome_haplohseq.posterior.dat) can be found in the example_output directory.
3. R is called to generate a plot for visualization of the haplohseq AI probabilities.

Here are the contents of example_run.sh to achieve those three steps:

#! /bin/bash

# Example:

# Identify allelic imbalance (AI) given a tumor

# exome VCF file generated using the UnifiedGenotyper

# of the GATK. This involves the following 3 steps.

printf "STEP 1: PHASING 1KG HET SITES ...\n"

python ../scripts/simple_phaser.py \

--ldmap ../ldmap/hg19.exome.ldmap \

--vcf example_input/tumor_exome.vcf \

-o example_output/tumor_exome

printf "\nSTEP 2: IDENTIFYING REGIONS OF AI ...\n"

../haplohseq \

--vcf example_output/tumor_exome.hap.vcf \

--phased example_output/tumor_exome.hap \

--event_prevalence 0.1 \

-d example_output \

-p tumor_exome_haplohseq

printf "\nSTEP 3: PLOTTING HAPLOHSEQ GENOMIC AI PROFILE ...\n"

Rscript ../scripts/haplohseq_plot.R \

--file example_output/tumor_exome_haplohseq.posterior.dat \

--out example_output \

--prefix tumor_exome_haplohseq

haplohseq parameters

> haplohseq --help

haplohseq (version 0.1.2) provides probabilities of allelic imbalance events in sequencing data.

Command-line parameters:

-v [ --version ] print version string

-h [ --help ] produce help message

--phased arg file of phased alleles

-f [ --freq ] arg file of allele frequencies - only for

microarray data (1-to-1 correspondence

with markers in --phased)

-v [ --vcf ] arg vcf file of genotype calls for sample -

inferred from sequencing data

--vcf_min_depth arg (=10) min depth for genotyped sites to be

included in HMM

--vcf_sample_name arg sample name to run haplohseq on -

needed if using a multi-sample VCF

-o [ --obs ] arg file of observed sequences used to

generate hidden most likely hidden

sequences

-d [ --dest ] arg destination directory for output files

--est_trans estimate HMM event state transition

params

--est_aberrant_emissions estimate HMM aberrant event state

emission params

--est_normal_emissions estimate HMM normal event state

emission params

-m [ --marker_density ] arg (=uniform)

strategy for modeling distances between

genotype markers (uniform, genomic_pos)

-p [ --prefix ] arg prefix for output files

-c [ --config ] arg name of HMM configuration file

-t [ --threads ] arg (=1) number of threads to use for haplohseq

--genome_mb arg (=3156) genome size in megabases (defaults to

3156MB - for a whole human genome

(hg19) size estimate) use 40MB for an

exome.

--event_prevalence arg (=0.10000000000000001)

proportion of the genome expected to be

aberrant (defaults to 0.2, or 20% -

haplohseq is robust to incorrect

estimates of event_prevalence)

--event_mb arg (=20) expected event size in megabases

(defaults to 20MB - haplohseq is robust

to incorrect estimates of event_length)

--initial_param_normal arg (=0.5) probability of emitting a 1 in the

normal state alpha_0 (defaults to 0.5)

--initial_param_event arg (=0.69999999999999996)

value for the first iteration

probability of emitting a 1 in the

event state alpha_i (defaults to 0.6)

--end_param_event arg (=0.69999999999999996)

value for the last iteration

probability of emitting a 1 in the

event state alpha_i (defaults to 0.6)

--num_param_event_starts arg (=1) number of starts to use for the

param_event starting at

initial_param_event to end_param_event

(defaults to 1 with initial_param_event

== end_param_event or to 2 if

initial_param_event != end_param_event)

--em_iterations arg (=25) number of EM iterations to fit HMM

parameters (defaults to 10)

--phase_error_estimate arg (=0.070000000000000007)

estimate for phasing error (defaults to

0.07)

--obs_type arg (=phase_concordance) observed data for the HMM:

phase_concordance or read_counts

--num_states arg (=2) number of states for HMM (defaults to

2).

File descriptions and formats

input files

The number and order of the genotypes in the phased genotypes file and the vcf file should correspond with each other.

phased genotypes

By default, haplohseq accepts the format of the MACH output file for the phased genotypes: a hap file (defining the haplotypes) and a pos file (specifying the positions for each genotype in the haplotype file). Some example files are generated in the example_output directory by running the quick start example.

See example_output/tumor_exome.hap for a full example of a hap file. Here is a snippet:

tumor_exome.hap HAP0 GACGAATTGAAGTTTCTCCTG...

tumor_exome.hap HAP1 AGGACGCCCCGACCCACGTCA...

See example_output/tumor_exome.pos for a full example of a pos file. Here is a snippet:

chr1:808631

chr1:876499

chr1:877715

chr1:881627

chr1:887560

chr1:887801

chr1:888639

chr1:888659

chr1:889158

chr1:889159

chr1:889238

chr1:894573

...

vcf file of genotypes

The VCF input file must be a single-sample VCF file. Genotype fields needed by haplohseq (these are provided by the GATK UnifiedGenotyper) include:

- GT: Genotype as 0/0, 0/1 or 1/1
- DP: Depth, or coverage, at the genotyped site
- AD: Allelic depths for the ref and alt alleles at the genotyped site

See example_input/tumor_exome.vcf for an example. This VCF was generated by using the GATK UnifiedGenotyper. For simplicity, the VCF field definitions in the comments section were removed (but they can be left in while running haplohseq).

output reports

haplohseq events

prefix.events.dat is generated by haplohseq defining events that had a >90% posterior probability of being in AI.

CHR BEGIN END EVENT_STATE NUM_INFORMATIVE_MARKERS MEAN_POSTPROB PHASE_CONCORDANCE

1 888639 248814126 S1 2512 0.998185 0.79379

3 3067850 197483279 S1 1272 0.985885 0.771226

4 960713 190906025 S1 983 0.997266 0.780264

5 143197 180486247 S1 1090 0.998114 0.845872

6 2624053 31116246 S1 362 0.909283 0.803867

6 31555130 170878886 S1 1023 0.987634 0.765396

9 286593 141071475 S1 896 0.991984 0.71875

10 875287 135438977 S1 975 0.997652 0.848205

11 214163 57684949 S1 479 0.976492 0.764092

12 347068 133463969 S1 1230 0.997673 0.791057

13 20716411 115047468 S1 400 0.986877 0.7525

15 22928537 102319267 S1 739 0.99597 0.843031

16 313406 27357927 S1 423 0.9871 0.858156

17 456563 34951458 S1 592 0.988575 0.814189

17 43507008 59544863 S1 122 0.949146 0.836066

18 688463 77918221 S1 349 0.988656 0.77937

21 28212760 48078611 S1 279 0.993926 0.845878

haplohseq posterior probabilities

prefix.posterior.dat is generated by haplohseq giving detailed information about the probability of each heterozygous site (that are also in the 1KG) being in a region of AI.

CHR POS REF ALT INDEX PHASE_CONCORDANCE_SWITCH HAP GT RAF DP S0 S1

1 808631 G A 0 0 1 0/1 0.411765 17 NA NA

1 876499 A G 1 0 1 0/1 0.533333 29 0.94806 0.0519398

1 877715 C G 2 0 1 0/1 0.357143 14 0.860915 0.139085

1 881627 G A 3 1 1 0/1 0.423529 162 0.714703 0.285297

1 887560 A C 4 1 1 0/1 0.433333 86 0.609568 0.390432

haplohseq plot

prefix.png is generated by the haplohseq_plot.R script. The green and blue dots depict the genome-wide RAF across the genome, where the color indicates a putative membership into haplotype 1 or 2. The red line represents AI posterior probabilities across the genome where the y-axis ranges from 0 to 1.

FAQ

1. How can I tune haplohseq to be more sensitive for subtle events? You can reduce the --initial_param_event and --end_param_event parameters to be closer to 0.5 (but greater than the --initial_param_normal parameter). You can also increase the --event_prevalence parameter which is 0.1 (e.g., it is expected that 10% of the genome is aberrant by default). Keep in mind that tuning these parameters to increase the sensitivity, will come at the expense of specificity.
2. Where can I learn more about the HMM underlying haplohseq? See Vattathil and Scheet, 2013 (Genome Research) for a description of the hapLOH model, which underlies haplohseq.
3. How do I cite haplohseq? The manuscript for haplohseq is currently in preparation. That citation will be added here if/when the manuscript gets published. Please also cite hapLOH (Vattathil and Scheet. Haplotype-based profiling of subtle allelic imbalance with SNP arrays. Genome Research 2013.), which describes the method upon which haplohseq is based.