PairClone

R & C package for the paper

PairClone: A Bayesian Subclone Caller Based on Mutation Pairs

by Tianjian Zhou, Peter Mueller, Subhajit Sengupta and Yuan Ji.

[Download package] [Manuscript]

Abstract

Tumour cell populations can be thought of as a composition of heterogeneous cell subpopulations, with each subpopulation being characterized by overlapping sets of single- nucleotide variants. Such subpopulations are known as subclones and are an important target for precision medicine. Reconstructing subclones from next generation sequencing data is one of the major challenges in computational biology. We present PairClone as a new tool to implement this reconstruction. The main idea of PairClone is to model short reads mapped to pairs of proximal single nucleotide variants, which we refer to as mutation pairs. In contrast, other existing methods use only marginal reads for unpaired single-nucleotide variants. Using Bayesian non- parametric models, we estimate posterior probabilities of the number, genotypes and population frequencies of subclones in one or more tumour sample. We use the categorical Indian buffet process as a prior probability model for subclones. Column vectors of categorical matrices record the corresponding sets of mutation pairs for subclones. The performance of PairClone is assessed by using simulated and real data sets with a comparison with existing methods.

Manual

This package contains the source files to run the MCMC algorithm in the paper, the simulation datasets and the lung cancer dataset described in the paper. See below for a detailed description for each file.

  • PairClone_main.R: This is the main code to run MCMC on a dataset. It gives posterior estimates of C, Z, W, etc.
  • PairClone_MCMC_PT.cpp: This is the main file for MCMC sampler. Users need to compile this in order to generate "PairClone_MCMC_PT.so" file which is required to run the code. Use
$ R CMD SHLIB PairClone_MCMC_PT.cpp 

in the terminal to compile this file.

  • PairClone_fn.R: This contains necessary functions to run the MCMC sampler in R.
  • PairClone_plot.R: This contains functions to generate the plots for Z and W.
  • simu_truth_x.RDS: Simulation truth and simulated data in Simulation x, x = 1, 2, 3, 4. Each simu_truth_x.RDS is a list of Z, W, n and the other parameters. Here Z and W are simulation truths, and n is the simulated data (T*K*G array).
  • lung_data_69_pairs.RDS: a list of n, here n is a 4*69*8 dimensional array, recording the read counts for the 69 mutation pairs across the 4 samples.
  • lung_data_862_snvs.RDS: a list of n and N, where n and N are 862*4 dimensional matrices, recording the variant and total read counts for the 862 SNVs across the 4 samples.

In order to run the simulation example: extract the package, go to the directory, compile "PairClone_MCMC_PT.cpp", run the code "PairClone_main.R" line by line in R console.