The CODATA-RDA Research Data Science Applied workshop on Bioinformatics
24-28 July 2017 - International Centre for Theoretical Physics (ICTP) , Trieste, Italy.
Slides are available at the bottom of the page.
Part 1: Introduction: Cancer Evolution
Part 2: Learning Cancer Progression Models from single-sample bulk sequencing data.
Part 3: Determining repeated evolution in cancer from multi-region sequencing datasets using transfer learning.
Detecting repeated cancer evolution in human tumours from multi-region sequencing data. Giulio Caravagna, Ylenia Giarratano, Daniele Ramazzotti, Trevor A Graham, Guido Sanguinetti, Andrea Sottoriva. Preprint.
On learning the structure of Bayesian Networks and submodular function maximization. Giulio Caravagna, Daniele Ramazzotti, Guido Sanguinetti. Preprint.
Algorithmic methods to infer the evolutionary trajectories in cancer progression. G.Caravagna, A.Graudenzi, D.Ramazzotti, R.Sanz-Pamplona, L.De Sano, G.Mauri ,V.Moreno, M.Antoniotti, B.Mishra. PNAS 113 (28), E4025–E4034 2016.
TRONCO: an R package for the inference of cancer progression models from heterogeneous genomic data. L.De Sano, G.Caravagna, D.Ramazzotti, A.Graudenzi, G.Mauri, B.Mishra, M.Antoniotti. Bioinformatics 32, 1911-1913, 2016.
CAPRI: efficient inference of cancer progression models from cross-sectional data. D.Ramazzotti, G.Caravagna, L.Olde Loohuis, A.Graudenzi, I.Korsunsky, G.Mauri, M.Antoniotti, B.Mishra. Bioinformatics 31(18), 3016-3026, 2015.
Inferring tree causal models of cancer progression with a shrinkage estimator and probability raising. L.Olde Loohuis, G.Caravagna, A.Graudenzi, D.Ramazzotti, G.Mauri, M.Antoniotti, B.Mishra. PLoS ONE, 9(10):e108358, 2014.
We will be seeing how the tools that we discussed work and, at the end of the tutorials, you should be able to run some analysis on your own. There are exercises that you can keep working on during the school, and I'll be around for questions and comments. All referenced files are attached at the bottom of this page.
We present the cBioPortal for Cancer Genomics hosted by the Memorial Sloan Kettering Cancer Center. cBio is a repository where we can access mostly high-quality single-sample bulk sequencing data of cancer genomes (e.g., data collected from The Cancer Genome Atlas or Genome Data Commons projects). I will show how to access data through the portal via its web interface, and how to perform some standard queries for data visualisations (e.g., oncoprints, etc). We will see, later, how this can be done also in R.
Cerami et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discovery. May 2012 2; 401.
I will briefly introduce you to Bioconductor, from which we download the TRONCO tool for TRanslational ONCOlogy. I will show some simple manipulation of a small dataset of atypical Chronic Myeloid Leukaemia, and will introduce the main functions implemented in the package. Then I will show how to automatically query data from the cBioPortal with TRONCO.
Bioconductor support page.
Sources: TBA.
Exercise: select one cancer type from cBio, download its mutation data and use TRONCO to visualize it. Then, generate a model via CAPRI and plot it.
You can either use the automatic download function or download data manually.
You should try to create some hypotheses by looking at the "Mutual exclusivity and co-occurrence" panel of cBio, and include them in the inference process.
Genes list
genes = c("APC", "CTNNB1", "DKK1", "DKK2", "DKK3", "DKK4", "LRP5", "FZD10", "FAM123B", "AXIN2", "TCF7L2", "FBXW7", "ARID1A", "SOX9", "ERBB2", "ERBB3", "NRAS", "KRAS", "BRAF", "IGF2", "IRS2", "PIK3CA", "PIK3R1", "PTEN", "TGFBR1", "TGFBR2", "ACVR1B", "ACVR2A", "SMAD2", "SMAD3", "SMAD4", "TP53", "ATM")
Exercise: for the cancer that you have selected and for which you have downloaded mutation data, you should now download also Copy Number data and use both data types to create a model.
Copy Number is available at cBio in the GISTIC format.
GISTIC data can be imported in TRONCO very easily - see function import.GISTIC. Once you have loaded CNA data and mutations, you can merge two TRONCO datasets.
We will analyze high-quality colorectal and rectum adenocarcinoma data released by The Cancer Genome Atlas within the COADREAD project. We will do that by using the Pipeline for Cancer Inference PiCnIc, and will replicate the main analysis carried out in the original PiCnIc paper.
Comprehensive molecular characterization of human colon and rectal cancer. The Cancer Genome Atlas Network. Nature 487.7407 (2012): 330.
Sources: available from GitHub.
Exercise: in the associated paper, TCGA carries out samples stratification according to different types of data (expression, methylation etc.). Replicate the analysis above by using one of the stratifications discussed by TCGA. For instance, instead of stratifying the cohort by clinical MSS/ MSI-HIGH status, use the classification CIN, MSI-CIMP, Invasive obtained from expression data. For simplicity, you can just produce one of the possible subtypes (e.g., Invasive).
Cluster assignments computed by TCGA is available in file 2011-11-14592C-SUP-TABLE-1.CSV.
We will analyze data with REVOLVER, and replicate some of the plots shown in the main paper. I will show you the analysis of TRACERx samples (n=100, lung cancer), and will provide you also with data for breast (n=50) and renal tumors (n=10).
Tracking the evolution of non–small-cell lung cancer. M Jamal-Hanjani et al., New England Journal of Medicine 376.22 (2017): 2109-2121.
Subclonal diversification of primary breast cancer revealed by multiregion sequencing. LR Yates et al., Nature Medicine 21.7 (2015): 751-759.
Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. M Gerlinger et al., New England Journal of Medicine 366.10 (2012): 883-892.
Sources: TBA.
Exercise 1: run the REVOLVER analysis (training set, test set) on breast cancer data, in a similar way to the one that we have implemented for TRACERx.
Exercise 2: run the REVOLVER analysis (training set, test set) on kidney cancer data, in a similar way to the one that we have implemented for TRACERx.
Exercise: feature selection via survival
We can think of PicNiC/ CAPRI models as feature-selection strategies. From genomic data, features are all possible orderings (i.e., possible edges), so our models are a way to select the relevant features according to their ability to explain dependencies. However, we might want to further select, among the output edges, those that are prognostic to a phenotype. Consider for instance survival, can we define a feature-selection strategy that correlates with survival?
The answer is yes, and a way of doing that is to use edges to define groups. For an edge A -> B, you can define groups GA and GAB where: GA, are A-mutated and B-wildtype tumors, GAB are tumors with both A and B mutated. Then you can use Kaplan-Meier plots (and logrank test), that we have seen with REVOLVER, to assess survival among groups.
To make it easier, download the ZIP archive EXAMPLE_SURVIVAL.ZIP, read and understand that script, and start from it to implement this analysis.
Exercise: clustering single-sample data
Sketch a way to use PicNiC/ CAPRI models to induce a stratification of the input cohort, as we did for REVOLVER. You can just discuss your ideas.
Exercise: changing data-resolution
Re-run the COADREAD analysis by keeping mutations distinct by type.
For variations on the standard loading of MAF data see function import.MAF and its parameters.
Scripts require some small editing here and there, and you should avoid adding hypotheses for multiple mutations on the same gene.