hapLOHseq

Introduction

haplohseq identifies regions of allelic imbalance (AI) in sequencing data obtained from impure samples where AI events exist in a potentially low proportion of cells in the sample.  Input to the software includes a VCF file of genotypes and estimated phased genotypes.  The software produces a posterior probability file that includes probabilities of an allelic imbalance event for each heterozygous, polymorphic genomic site contained in the VCF file.  Briefly, to estimate the probabilities of AI events, or loss-of-heterozygosity and copy-number-aberration events (LOH and CNA), these are the basic steps:

Step 1 could be run using your choice of a genotype caller and step 2 can be done with a variety of existing phasing software and reference panels (we provide a companion phasing utility with haplohseq to do this as well - although the method is currently unpublished).  

For step 1, we have been running the GATK UnifiedGenotyper for genotype calling.  

For step 2, we have been performing simple phasing using a utility (simple_phaser.py that is bundled with haplohseq) that is dependent on a 1000 Genomes reference panel (which is underlying hg19.exome.ldmap which is a dependency for simple_phaser.py), but your choice of data and phasing software is up to you (see required formats in the file descriptions below).

As an alternative, a tutorial for phasing using MACH and the 1000 Genomes can be found at http://csg.sph.umich.edu/abecasis/MACH/tour and downloads for MACH and the 1000 Genomes data can be found at http://csg.sph.umich.edu/abecasis/MACH/download.

haplohseq has been developed for the detection of subtle allelic imbalance events from next-generation sequencing data.  haplohseq is a sequencing-based extension of hapLOH (Vattathil and Scheet, Genome Research 2013), which is a method for the detection of subtle allelic imbalance events from SNP array data.  haplohseq is capable of identifying events of 10 mega-bases or greater occurring in as little as 16% of the sample using exome sequencing data (at 80x) and 4% using whole genome se-quencing data (at 30x), far exceeding the capabilities of existing software. 

Download/installation

Quick start example

After you download and extract haplohseq_platform.tar.gz, there is an example tumor_exome.vcf file in the example directory that you can process.  You can execute example_run.sh from the example directory of the tar.gz extraction to run the complete example.

> tar -xvzf haplohseq_macosx-0.1.2.tar.gz

> cd haplohseq_macosx/example

> ./example_run.sh

This script, executes three main steps:

Here are the contents of example_run.sh to achieve those three steps:

#! /bin/bash

# Example:

# Identify allelic imbalance (AI) given a tumor

# exome VCF file generated using the UnifiedGenotyper

# of the GATK. This involves the following 3 steps.

printf "STEP 1: PHASING 1KG HET SITES ...\n"

python ../scripts/simple_phaser.py \

  --ldmap ../ldmap/hg19.exome.ldmap \

  --vcf example_input/tumor_exome.vcf \

  -o example_output/tumor_exome

printf "\nSTEP 2: IDENTIFYING REGIONS OF AI ...\n"

../haplohseq \

  --vcf example_output/tumor_exome.hap.vcf \

  --phased example_output/tumor_exome.hap \

  --event_prevalence 0.1 \

  -d example_output \

  -p tumor_exome_haplohseq

printf "\nSTEP 3: PLOTTING HAPLOHSEQ GENOMIC AI PROFILE ...\n"

Rscript ../scripts/haplohseq_plot.R \

  --file example_output/tumor_exome_haplohseq.posterior.dat \

  --out example_output \

  --prefix tumor_exome_haplohseq

haplohseq parameters

> haplohseq --help

haplohseq (version 0.1.2) provides probabilities of allelic imbalance events in sequencing data.

Command-line parameters:

  -v [ --version ]                      print version string

  -h [ --help ]                         produce help message

  --phased arg                          file of phased alleles

  -f [ --freq ] arg                     file of allele frequencies - only for 

                                        microarray data (1-to-1 correspondence 

                                        with markers in --phased)

  -v [ --vcf ] arg                      vcf file of genotype calls for sample -

                                        inferred from sequencing data

  --vcf_min_depth arg (=10)             min depth for genotyped sites to be 

                                        included in HMM

  --vcf_sample_name arg                 sample name to run haplohseq on - 

                                        needed if using a multi-sample VCF

  -o [ --obs ] arg                      file of observed sequences used to 

                                        generate hidden most likely hidden 

                                        sequences

  -d [ --dest ] arg                     destination directory for output files

  --est_trans                           estimate HMM event state transition 

                                        params

  --est_aberrant_emissions              estimate HMM aberrant event state 

                                        emission params

  --est_normal_emissions                estimate HMM normal event state 

                                        emission params

  -m [ --marker_density ] arg (=uniform)

                                        strategy for modeling distances between

                                        genotype markers (uniform, genomic_pos)

  -p [ --prefix ] arg                   prefix for output files

  -c [ --config ] arg                   name of HMM configuration file

  -t [ --threads ] arg (=1)             number of threads to use for haplohseq

  --genome_mb arg (=3156)               genome size in megabases (defaults to 

                                        3156MB - for a whole human genome 

                                        (hg19) size estimate) use 40MB for an 

                                        exome.

  --event_prevalence arg (=0.10000000000000001)

                                        proportion of the genome expected to be

                                        aberrant (defaults to 0.2, or 20% - 

                                        haplohseq is robust to incorrect 

                                        estimates of event_prevalence)

  --event_mb arg (=20)                  expected event size in megabases 

                                        (defaults to 20MB - haplohseq is robust

                                        to incorrect estimates of event_length)

  --initial_param_normal arg (=0.5)     probability of emitting a 1 in the 

                                        normal state alpha_0 (defaults to 0.5)

  --initial_param_event arg (=0.69999999999999996)

                                        value for the first iteration 

                                        probability of emitting a 1 in the 

                                        event state alpha_i (defaults to 0.6)

  --end_param_event arg (=0.69999999999999996)

                                        value for the last iteration 

                                        probability of emitting a 1 in the 

                                        event state alpha_i (defaults to 0.6)

  --num_param_event_starts arg (=1)     number of starts to use for the 

                                        param_event starting at 

                                        initial_param_event to end_param_event 

                                        (defaults to 1 with initial_param_event

                                        == end_param_event or to 2 if 

                                        initial_param_event != end_param_event)

  --em_iterations arg (=25)             number of EM iterations to fit HMM 

                                        parameters (defaults to 10)

  --phase_error_estimate arg (=0.070000000000000007)

                                        estimate for phasing error (defaults to

                                        0.07)

  --obs_type arg (=phase_concordance)   observed data for the HMM: 

                                        phase_concordance or read_counts

  --num_states arg (=2)                 number of states for HMM (defaults to 

                                        2).

File descriptions and formats

input files

The number and order of the genotypes in the phased genotypes file and the vcf file should correspond with each other. 

phased genotypes

By default, haplohseq accepts the format of the MACH output file for the phased genotypes: a hap file (defining the haplotypes) and a pos file (specifying the positions for each genotype in the haplotype file).  Some example files are generated in the example_output directory by running the quick start example. 

See example_output/tumor_exome.hap for a full example of a hap file.  Here is a snippet:

tumor_exome.hap HAP0 GACGAATTGAAGTTTCTCCTG...

tumor_exome.hap HAP1 AGGACGCCCCGACCCACGTCA...

See example_output/tumor_exome.pos for a full example of a pos file.  Here is a snippet:

chr1:808631

chr1:876499

chr1:877715

chr1:881627

chr1:887560

chr1:887801

chr1:888639

chr1:888659

chr1:889158

chr1:889159

chr1:889238

chr1:894573

...

vcf file of genotypes

The VCF input file must be a single-sample VCF file.  Genotype fields needed by haplohseq (these are provided by the GATK UnifiedGenotyper) include:

See example_input/tumor_exome.vcf for an example.  This VCF was generated by using the GATK UnifiedGenotyper.  For simplicity, the VCF field definitions in the comments section were removed (but they can be left in while running haplohseq).

output reports

haplohseq events

prefix.events.dat is generated by haplohseq defining events that had a >90% posterior probability of being in AI.

CHR BEGIN END EVENT_STATE NUM_INFORMATIVE_MARKERS MEAN_POSTPROB PHASE_CONCORDANCE

1 888639 248814126 S1 2512 0.998185 0.79379

3 3067850 197483279 S1 1272 0.985885 0.771226

4 960713 190906025 S1 983 0.997266 0.780264

5 143197 180486247 S1 1090 0.998114 0.845872

6 2624053 31116246 S1 362 0.909283 0.803867

6 31555130 170878886 S1 1023 0.987634 0.765396

9 286593 141071475 S1 896 0.991984 0.71875

10 875287 135438977 S1 975 0.997652 0.848205

11 214163 57684949 S1 479 0.976492 0.764092

12 347068 133463969 S1 1230 0.997673 0.791057

13 20716411 115047468 S1 400 0.986877 0.7525

15 22928537 102319267 S1 739 0.99597 0.843031

16 313406 27357927 S1 423 0.9871 0.858156

17 456563 34951458 S1 592 0.988575 0.814189

17 43507008 59544863 S1 122 0.949146 0.836066

18 688463 77918221 S1 349 0.988656 0.77937

21 28212760 48078611 S1 279 0.993926 0.845878

haplohseq posterior probabilities

prefix.posterior.dat is generated by haplohseq giving detailed information about the probability of each heterozygous site (that are also in the 1KG) being in a region of AI.

CHR POS REF ALT INDEX PHASE_CONCORDANCE_SWITCH HAP GT RAF DP S0 S1

1 808631 G A 0 0 1 0/1 0.411765 17 NA NA

1 876499 A G 1 0 1 0/1 0.533333 29 0.94806 0.0519398

1 877715 C G 2 0 1 0/1 0.357143 14 0.860915 0.139085

1 881627 G A 3 1 1 0/1 0.423529 162 0.714703 0.285297

1 887560 A C 4 1 1 0/1 0.433333 86 0.609568 0.390432

haplohseq plot

prefix.png is generated by the haplohseq_plot.R script.  The green and blue dots depict the genome-wide RAF across the genome, where the color indicates a putative membership into haplotype 1 or 2.  The red line represents AI posterior probabilities across the genome where the y-axis ranges from 0 to 1.

FAQ