Diagnosing breast cancer based on PBMC gene expression profile using Bayesian additive regression trees (BART) method

Zicheng Hu and Kaijun Lu

Molecular Genetics and Microbiology, College of Natural Sciences
Marine Science Institute, College of Natural Sciences

    In this study, we try to develop a method to diagnose breast cancer based on the gene expression profile of PBMC (peripheral blood mononuclear cells). The assumption is that some of the PBMC have been migrated into the tumor and changed their gene expression profile. As a result, breast cancer patients and healthy people will have different gene expression patterns in PBMC. Therefore, the gene expression profile of PBMC, which are more accessible than tumor samples, may be a good indicator of the cancer status of the patient.

    Linear regression methods will perform poorly when dealing with the gene expression profile data, due to the fact that the expression level of genes are highly correlated with each other and the fact that the relationship between cancer status and gene expression profile are non-linear. Instead, a Bayesian
additive regression trees (BART) method will be used. The BART methods use tree like structures to capture the non-linear relationship between the gene expression levels and the cancer status. 

    Briefly, the model is:

Yi ~ bern(pi)
Probit(pi) = g1(Xi) + g2(Xi) +…….+ gn(Xi)+ R

    Yi is the cancer status of patient i. Y=1 means the patient i has cancer. Otherwise, the patient is healthy.
    pi is the probability that the patient i has cancer.
    g1…….gn are n decision trees that each returns a single value.
    R will be the residue. R ~ normal.
    Xi are the gene expression profile of PBMC from patient i. It will be a vector containing  the expression level of each gene.
   More details on the BART model can be found here.
   More details on the specific BART model constructed for diagnosing cancer can be found here.

    We use the default priors on the parameters in g1…….gn and R described in (Chipman, 2010) paper. We then use a set of training data containing the gene expression profile of 35 cancer patient and 20 healthy people to update these parameters.  Then we  test the performance of  the method by running the regression on a set of testing data, which contains the gene expression profile of 13 cancer patient and 11 healthy people. The result is shown in figure 2. The predicted probability of having cancer is higher in cancer patient then in healthy people. If we choose the cut-off to be 0.6, the prediction is correct on majority of samples, only one sample is predicted wrong (red arrow). More details can be found here.

Figure 2

    BART is a MCMC based algorithm, it will automatically choose the "important" genes for making prediction. As seen in figure 3, some genes are more frequently used in making prediction than others. The expression of those genes can be viewed as the signature of cancer. More details on the gene selection for cancer signature can be found here.

Figure 3
    The genes that are useful for diagnose are:

Gene name LOC100289058 RPS10 RPS23 RPS17 RPSA TMA7 CST6 PRORSD1P AI732986
Usage 0.029768409 0.060487284 0.020760328 0.036368359 0.022904412 0.070130458 0.033133748 0.042754138 0.031540927