http://mailman.disco.unimib.it/mailman/listinfo/tronco

Which problem do I solve with TRONCO?

Despite the increasing availability of multiple omics data, the identification of explanatory models of how the (epi)genomic events are choreographed in cancer initiation and development still poses various theoretical and technical hurdles, mostly related to the dramatic heterogeneity and temporality of the disease.  

The problem of extracting such models from genomic data can be formulated in two ways, as one can look for evolutionary trends characteristics of a population, i.e. ensemble-level, or clonal progression in a single-patient. Both problems require to exploit different input data to understand the temporal ordering of somatic lesions accumulating during cancer evolution. 

In the former case, one usually examines m lesions in the genomes of n 
cross-sectional 
independent 
patients and extracts a 
a probabilistic graphical model of the temporal ordering of fixation and accumulation of such alterations in the input cohort. 
Sample size and tumor heterogeneity make the problem of extracting population-level trends hard to solve, as this requires to account for patients' specificities such as multiple starting events. 

For an individual tumor, its clonal phylogeny and prevalence is usually inferred from multiple biopsies  or single-cell sequencing data. Phylogeny-tree reconstruction from an underlying statistical model of reads coverage or depths estimates  alterations' prevalence in each clone, as well as ancestry relations.  This problem is made difficult by the high levels of intra-tumor heterogeneity and sequencing issues.

In TRONCO we unify both the approaches and propose algorithms which can process data usually used in the domain of ensembles, i.e., the list of lesions per sample, to solve both types of problems.


Which kind of data can I use with TRONCO? 

The algorithms currently implemented in TRONCO assume to process lesions - in a sample - that are persistent across tumor evolution; this allows to derive a measure of temporal precedence from data which do not provide any evident time signature.

Usually, data supporting this assumption are:
  • somatic mutations, hitting one or more nucleotides;
  • wider chromosomal lesions such as Copy Number Variations. 
but, under the persistence hypotheses,  DNA methylationup/down-regulation or any other relevant epigenetic state might be used as well. We deliberately use words such as "lesion" or "alteration" to let the user decide which data type and resolution are suitable for a cancer, as mutations might be further classified by their type, or chromosomal variants might be considered at higher resolutions, such as that of arms or cytobands. 


Single-sample (bulk), multi-region (bulk) or single-cell?

The type of data that one has available usually determines the kind of problem that he is going to face. Single-sample bulk data (e.g., The Cancer Genome Atlas data) are for the ensemble-level inference, namely the inference of a progression model from a population of independent cancer patients. Multi-region bulk or single-cell, instead, can be used to study tutor evolution in individual-patients



Does TRONCO support some common data formats? 

Since version 2.0, TRONCO implements some functions to easily manipulate common data formats such as:
as well as functions to query the Cbio portal which mirrors data from The Cancer Genome Atlas. TRONCO provides you a function to visualize the processing data with a layout similar to the oncoprint available in Cbio.



Should I pre-process my data before using TRONCO?

Extracting progression models can be very hard, especially at the ensemble-level. In fact, this requires efficient strategies to detect alternative routes to selective advantage across a population of tumors, as of selective pressures which might be the result of distinct - and unknown - genetic alterations conferring equivalent fitness to cancer cells. 

An immediate consequence of this and other states of affair is the dramatic heterogeneity of cancer, a fact that can lead to inferring wrong models. For this reason, we have developed PicNiC, a pipeline to pre-processes ensemble-level data and diminish the confounding effects due to common forms of  heterogeneity.


What is PicNic, the Pipeline for Cancer Inference?


See PicNiC's webpage.


What is the CAPRI algorithm?


This CAncer PRogression Inference algorithm, CAPRI,  is capable of inferring directed acyclic graph (DAG) progression models, from a list of annotated mutations/CNAs/...  per sample. It is up to you to decide at which resolution you want to model the genomic lesions that will appear in the model. 

CAPRI reconstructs a probabilistic progression model by inferring “selectivity relations”, where an alteration in a gene A “selects” for a later mutation in a gene B, which is displayed as an edge


Each color represents a type of alteration, that shall be persistent. These relations are collected in a DAG and resemble a mode of “selective advantage”  --- A selecting for B -- in a clonal competition scenario. An A-mutant clone shall enjoy a clonal expansion, and the next wave of clonal expansion would select a A,B-mutant subclone.

Relations are estimated by imposing causality conditions estimated from data, leading to a model that we call Suppes-Bayes Causal Network:
  1. A is estimated to be earlier than B, i.e., a temporal relation of precedence is imputed;
  2. the presence of the earlier genomic alteration (i.e., the upstream event, A) increases the probability of observing the latter (i.e., the downstream event, B). This is called “probability raising”, A raises the probability of B.
Assessment of p-values for conditions 1 and 2 is performed within a non-parametric bootstrap framework. If both condition hold, an edge is included in a preliminary model. This model is then considered for further refinement of the inference, and spurious relations are filtered with a standard Bayesian model-selection approach (i.e., regularisation), leading to the final graphical model. 

CAPRI's formulas. One of the peculiarities of the algorithm is the ability to test/select complex structures (to, e.g., detect fitness-equivalent evolutionary trajectories). These can be given in input as logical formulas that shall describe non-linear relations that involve more than one alteration at a time. For example,
  • B:homozygous-loss xor B:snv
    • that reads as "homozygous deletions and single nucleotide variants in gene B are strictly exclusive", as the xor stands for hard mutual exclusivity;
Formulas are visualized with a network structure, where the squared node represents the logical connective, and the color represents the different alterations in the formula. For the example above, this is the network structure that is created within CAPRI.


Such  formulas can be created by exploiting prior knowledge of alternate lesions (e.g. KRAS/NRAS), and automatic computational tools, see PicNiCEvery such structure is statistically tested, in a solid statistical framework ,and included in the final mode only if it really contributes to better explain the data. CAPRI aims at minimizing the overfit of the model

When a formula appears in a model, it is connected (upstream, or downstream) to other nodes. The semantics of the formula allows to model events such as branched or confluent evolutionary trajectories; according to the type of connectives that you use. When we want to extract progression models from an ensemble of heterogenous cross-sectional samples, using formulas is a smart way to test hypothesis about complex evolutionary trajectories that capture individual specificities of each potential cancer patient. In the example above, the model would depict distinct evolutionary trajectories, i.e., some tumors/patients will be likely to evolve through  homozygous deletions of B, other through single nucleotide variants in the same gene, the occurrence of both being due to chance - besides being impossible, in this particular case.

This is a graphical interpretation of common examples.
  •  branched evolutionary trajectories where A-mutant clones will enjoy expansion with a further wave of clonal selection once they lose wild-type B, either because of copy number alteration or a mutation;
  • confluent evolutionary trajectories where A-mutant and B-mutant clones are predicted to eventually progress through alterations in C, and finally D.
Arbitrary such formulas can be spelled out and tested, as we discuss in detail in the PicNiC pipeline.

In the main CAPRI paper, we have shown that the algorithms performs well even with a relatively small number or samples, with augmented performances, compared to existing methods. 


What is the CAPRESE algorithm?

This algorithm is the earliest contribution of our team to the inference of tree-based progression models, thus is called CAncer PRogression Extraction with Single Edges (CAPRESE). These can capture phenomena such as trunk and branched evolution, but are not suitable to capture convergent evolutionary trajectories, as is possible with CAPRI, because the underlying model is assumed to be a tree, or a forest of trees.

CAPRESE is derived by considering a general reconstruction setting complicated by the presence of noise in the data due to biological variation, as well as experimental or measurement errors. To improve tolerance to noise the algorithms uses a shrinkage-like statistical estimator, so to enjoy correctness properties such as asymptotic convergence to the correct tree under mild constraints on the level of noise. 

In its description paper, we have shown that  it is efficient (and better than competing algorithms that make similar assumption on the underlying progression model), even with a relatively small number of samples. CAPRESE's performance quickly converges to its asymptote as the number of samples increases. 


Can I propose my algorithm to become part of the package?

Yes, TRONCO is engineered to be flexible and it welcomes multiple different algorithms. Of course, your algorithm should have scope similar to the ones currently implemented in the tool, and you might be asked to help us coding for your algorithm. 

Feel free to get in touch with us to discuss these issues further.


How do I interface TRONCO with other tools?


TRONCO is written in R to make it easy to combine it with other major bioinformatics tools, such as those archived at Bioconductor. In our case studies, we often use tools to pre-process data and we provide with TRONCO some functions which we use to make the daily routines of the lab easier. From this, we actually derived our Pipeline for Cancer Inference PiCnIc, which exploits these routines to pre-process cancer samples. 


Can I fetch or process data from public repositories with TRONCO? 

Of course you can. In different ways we support a query and data-processing system compatible with The Cancer Genome Atlas and the Cbio portal for Cancer Genomics.
You can not fetch data from TCGA automatically. However, we provide you with some functions which help to: extract custom clinical data, detect the presence of multiple samples from the same patient (from the barcode formats), remove multiple samples by using TCGA aliquote disambiguation rules, shorten barcodes to the first 12 chars if there are no duplicates.
You can query datasets from this portal by using the TRONCO function cbio.query which wraps the CGDS-R package implemented at Sloan-Kettering. You can also export TRONCO datasets for visualization with the portal by using function oncoprint.cbio.