PDAC Prediction with a Urine Biomarker Penal

Yujie (Janet) He

METHODS

Dataset Description

Logit Model

Fitting with Maximum Likelihood Estimation (MLE)

Fitting with Bayesian Estimation

Machine Learning Models

XGBoost

LightGBM

TabPFN

Dataset Description

According to the previous reports, it is believed that creatine, YVLE1, REG1B and TFF1 levels are positive related to the diagnosis of the PDAC, several factors including ages and genders are considered to impact the diagnosis in various way. The urine sample from every patient as well as control individuals are the unit of analysis in this study. The data collected by Crnogorac-Jurcevic et al were originally from multiple separated sources including BPTB (Barts Pancreas Tissue Bank), LIV (Liverpool University), and ESP (Spanish National Cancer Research Centre). Data collect from UCL (University College London) are removed because none of them are from pancreatic cancer patients. Data source was treated as a categorical controlled variable in the model. All specimens were collected before surgery or chemotherapeutic treatment and were age- and sex-matched wherever possible, as described in the paper. The 570 samples were split into 2 subsets: 512 of them were applied to analyzing and building the model, while the rest 58 were used to test the hypotheses.

Logit Model

Since the goal of this project is diagnostics, the dependent variable was regarded as binary. Thus, a logit model was one of the good choices. Before fitting the model, the correlation between every two independent variables was estimated as below. According to the calculated correlation coefficients, REG1B level and TFF1 level have the strongest correlation among all the pairs. Thus, REG1B*TFF1 (x_i3*x_i4) was included as an extra variable when constructing the model.

Correlation among 4 urine protein levels generated with R. None of the variable pairs are totally independent. To simplify the model, only the pair with the highest coefficient was considered.

Since the goal of this project is diagnostics, the dependent variable was binarized. Thus, a logit model was one of the good choices. Before fitting the model, the correlation between every two independent variables will be estimated. The covariates in the Dataset Description section were represented by a matrix X. The binary diagnoses were represented by Yi. To simplify the model, I assumed that the all covariates plus intercept are independent, except for the urine TFF1 level (xi3) and REG1B level (xi4), according to the correlation validation.

Fitting with Maximum Likelihood Estimation (MLE)

In a logistic regression model, the probability of an outcome Yi given the parameter π is

The ML estimator Li(β) represents the probability of observing the given data from the ith sample under the parameters β. While L(β) represents the probability of observing the whole dataset under the parameters β.

MLE was done in R using the glm() function.

Fitting with Bayesian Estimation

Assume that the prior of the parameter β follows the multivariate normal distribution:

To simplify the model, we assume that the all covariates plus intercept are independent. Thus,

An MCMC (Markov chain Monte Carlo) algorithm was developed to fit the model. Details can be found in Supplementary Materials.

Machine Learning Models

Different machine learning models were applied to predict the diagnosis of PDAC. This part was done with Python. The performances were compared in the Results and Discussions part.

github.com/dmlc/xgboost

github.com/microsoft/LightGBM

github.com/PriorLabs/TabPFN

XGBoost

LightGBM

TabPFN

Page updated

Google Sites

Report abuse