TreeClone

R & C package for the paper

TreeClone: Reconstruction of Tumor Subclone Phylogeny Based on Mutation Pairs using Next Generation Sequencing Data

by Tianjian Zhou, Subhajit Sengupta, Peter Mueller and Yuan Ji.

[Download package] [Manuscript]

Abstract

We present TreeClone, a latent feature allocation model to reconstruct tumor subclones subject to phylogenetic evolution that mimics tumor evolution. Similar to most current methods, we consider data from next-generation sequencing of tumor DNA. Unlike most methods that use information in short reads mapped to single nucleotide variants (SNVs), we consider subclone phylogeny reconstruction using pairs of two proximal SNVs that can be mapped by the same short reads. As part of the Bayesian inference model, we construct a phylogenetic tree prior. The use of the tree structure in the prior greatly strengthens inference. Only subclones that can be explained by a phylogenetic tree are assigned non-negligible probabilities. The proposed Bayesian framework implies posterior distributions on the number of subclones, their genotypes, cellular proportions and the phylogenetic tree spanned by the inferred subclones. The proposed method is validated against different sets of simulated and real-world data using single and multiple tumor samples.

Manual

This package contains the R and cpp code files and an example data to run the MCMC algorithm described in the TreeClone paper.

  • TreeClone_fn.R, TreeClone_plot.R and TreeClone_simu.R: These are the R functions for running the MCMC simulation, plotting the heatmaps and generating/processing simulated data, respectively.
  • TreeClone_main.R: This is the main function that takes the following command line arguments input_file, suffix, T and K and generates MCMC samples.
    • input_file: a .txt file which records the (vectorized) read counts for T samples, K mutation pairs and G short read categories. The read counts are vectorized (it should be a T*K*G R array but vectorized by R array to vector c()).
    • suffix: the suffix you want to add to your result file. The result file would be stored under "./results/MCMCspls_suffix.rds" and "./results/point_est_suffix.rds".
    • T: number of tissue samples.
    • K: number of mutation pairs.
  • PairTree_MCMC_PT.cpp, PairTree_MCMC_R.cpp: These are the C functions for implementing the parallel tempering and MCMC simulation for sampling trees.
  • gen_TreeStateMat.cpp: This is the C function generating all possible row combinations satisfying a given tree structure.
  • ./data/simdata2_N500_rep1.rds, ./data/simdata2_N500_rep3.rds, ./data/simdata3.rds: Simulation truth (Z, w, rho, ...) for Simulations 2 and 3 in the manuscript. (simdata2_rep3 is for Cloe and PhyloWGS, using 600 SNVs)
  • ./data/nsim2_N500_rep1.txt, ./data/nsim2_N500_rep3.txt, ./data/nsim3.txt: Simulated (vectorized) read counts for Simulations 2 and 3 in the manuscript.


In order to run the example:

  • Extract the package, go to the directory, run the Makefile (with appropriate gsl headers and library location).
  • Type the following command line arguments, and MCMC samples and point estimates will be generated.
$ Rscript TreeClone_main.R <input_file> <suffix> <T> <K>

For example, run the following in the terminal

$ Rscript TreeClone_main.R ./data/nsim2_N500_rep1.txt sim2_N500_rep1 1 100

MCMC samples and point estimates will be saved under ./results/MCMCspls_suffix.rds and ./results/point_est_suffix.rds. For example, ./results/MCMCspls_sim2_N500_rep1.rds and ./results/point_est_sim2_N500_rep1.rds.