Supporting web page for the paper 'Patient-specific data fusion defines prognostic cancer subtypes'

Yinyin Yuan*, Richard Savage*, Florian Markowetz

Breast cancer data set: download from the bottom of this page (
Prostate cancer data set: download from Sawyers et al. 2010.

Matlab Code
The code is contained in the file patientSpecificDataFusion.tar.gz, which can be downloaded from this page.
There's a README file in the main directory of the tar file which should get you started.

This program is free software: you can redistribute it and/or modify it under the terms 
of the GNU General Public License as published by the Free Software Foundation, either 
version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; 
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 
See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. 
If not, see


1.    Data preprocessing

The processing and selection of features for input to PSDF depends crucially on the resolution of the microarray data. We propose three different approaches: probe-based, region-based and gene-centric. With lower resolution microarrays, the probes can be used directly as features. If copy number data are found to be highly correlated, the probe-based features can be merged to generate region-based copy number features. High-resolution, genome-wide data can be collapsed to a gene-centric scaffold, based on genome annotation. For all approaches, concomitant analysis can then be applied to extract coherent copy number and expression features.

1.1. Feature preprocessing.

1.1.1. Probe-based features: directly use the probe-based data from microarrays for both copy number and expression data.

1.1.2. Region-based features: used in the breast cancer data set analysis, copy number data are merged based on their similarity using the function mergeCN in the R package DANCE, which implements a wrapper for the function CGHregions in package CGHregions. We set the parameter averror to 0.05. The output is the genomic location of merged regions, for which copy number data can be extracted to summarise copy number changes in these regions using the function getDataRegions in DANCE.

1.1.3. Gene-centric features: used in the prostate cancer data set analysis, both copy number and expression data can be summarised for each gene by taking the median values of multiple probes in a gene. 

1.2. Feature selection. Given any type of features from step 1.1, concomitant copy number and expression features can be extracted using the function getSignatures in DANCE. This function pairs copy number and expression features and pass them through binomial tests, so that in the end they are associated with adjusted p-values. In the paper we used p<0.1 as the threshold to select the most significant features.

2.    Clustering

2.1. Discretised data: Copy number data calls can be made using the R package CGHcall. The getSignatures function in DANCE automatically generates discretised expression data. To this end, both data are discretised into three levels, represented numerically as 1,2,3.

2.2. PSDF clustering:  PSDF is implemented in Matlab.  For each input data type, a wrapper function is created to read in the data.  The PSDF analysis is then performed, with each run producing a single MCMC chain.  In order to achieve well-mixed results, 50 chains each of 105 samples in length are run.  These are run in parallel, on a multi-node computer cluster. 

Once all the MCMC chains have been produced, they are read into R using the CODA package.  From these, the posterior similarity matrix is estimated, giving the pairwise probability of each pair of items being in the same cluster and fusion state.  The posterior probability of each item being fused, P(fusion), is also estimated, as is the probability of each input feature being informative, P(biomarker).  Consensus clustering partitions are then extracted using the R package MCCLUST.  Partitions are extracted for the Fused case (P(fusion)>=0.5), Unfused (P(fusion)<0.5) and All samples.    

3.    Interpretation
For fused clusters, unfused clusters, or all samples, the following analyses are applied.

3.1. Extract subtype-specific features: For each cluster, features can be ranked by comparing the data of the cluster with the rest of samples. Here features are not limited to those used in the clustering.

for all clusters

    for all copy number and expression features

        run lmFit function from R package Limma

        get associated p-value and log fold change from the Limma result

        if a feature has p<0.1 and log fold change>0.2

              the feature is termed subtype-specific

3.2. Subtype-specific network modules: From a PPI network, R package BioNet can be used to extract subtype-specific network modules, given for each gene it is specificity to a subtype. We defined the specificity using the p-values directly from Limma.

Get PPI network from HPRD

for all clusters

    run BioNet using the subtype-specific features for this cluster, the Limma p-values associated with these features, and the PPI network

    plot the module output from BioNet

3.3. Subtype-specific enrichment pathways: With the subtype-specific features, enrichment analysis can be applied to uncover associated pathways for each subtype.
for all clusters
         for the top 800 subtype-specific features
               Perform the enrichment analysis using the function analyze in R package HTSanalyzeR and database package KEGG.db
KEGG pathways with adjusted p<0.01 are allowed into the enrichment map using the function viewEnrichMap in HTSanalyzeR


Rich Savage,
Aug 18, 2011, 12:39 AM
Rich Savage,
Aug 17, 2011, 1:57 AM