PDAC Prediction with a Urine Biomarker Penal
Yujie (Janet) He
Yujie (Janet) He
RESULTS
The logit model fitting summary with MLE is shown in the table on the right. In addition to the listed values, there are other parameters to gauge the fitting:
Null deviance = 662.74 for 511 DoF
Residual deviance = 361.58 on 501 DoF
AIC = 383.58
Logit model fitting summary with MLE
The marginal effects of creatine, LYVE1, REG1B and TIFF1 levels with different categories were analyzed separately. Collectively, urine creatine level has an inverse relationship between creatine levels and the likelihood of the pancreatic cancer, and a lower creatine level has a stronger marginal effect. On the other hand, the probability of the diagnosis increase with both REG1B and LYVE1 levels, while the former is less reliable due to the larger uncertainty. However, the urine level of TIFF1 has a huge uncertainty which may not have very strong marginal effect in the diagnosis model.
Marginal effects of covariates
Despite the potential relationship between covariates TFF1 level and REG1B level, the further analysis into the conditional internal interaction suggested that the relationship between these two covariates is so weak as to be negligible, since the line is relatively flat, and the effect size is very small, as shown in figures on the right.
Conditional internal interaction between TFF1 and REG1B levels
The logit model was also fitted by the Bayesian method. Here, to simplify the estima- tion, the effects of internal reaction as well as categorical control variables including sample origins and patient cohorts were not taken into account when fitting the model. The covariates were all scaled before fitting. The results were summarized below.
The trace plots for coefficients β are shown in Figures below, suggesting successful convergence of the MCMC algorithm. An acceptance rate of approximately 43.1% suggests that the algorithm is well-tuned and efficiently exploring the posterior distribution.
Trace plots for coefficients β
The testing dataset which includes 60 individual urine samples were used to evaluate the accuracies of the fitted logit models. The confusion matrixes and ROC curves were shown below.
Collectively, the parameters estimated by MLE gave a higher accuracy for future prediction compared to Bayesian in this example, since the AUC value is higher, and both type I and type II error rates are lower. While this may be arise from the lack of the introduction of categorical control variables, or the simplification of the possible internal interactions. While the former one may play a more important role, since the internal interaction between TFF1 level and REG1B level is neglegible according to the previous analysis.
Confusion matrixes of the logit models fitted with different methods
ROC curves of the logit models fitted with different methods
On the other hand, the logit model fitted with Bayesian method has a higher AIC value, which suggests that the Bayesian model is, in some other facet, more favorable.
Three machine learning models including XGBoost, LightGBM and TabPFN were used to predict the diagnosis. Among these models, XGBoost achieved the highest accuracy of 98.3%, showing excellent predicting capability over the other two models as well as the traditional statistical models discussed above.
The ROC curves on the right indicate that all the machine learning models exhibit similar AUC values, suggesting comparable abilities to minimize false positive predictions.
ROC curves of the XGBoost, TabPFN and LightGBM models
Confusion matrixes of the XGBoost, LIghtGBM and TabPFN models based on machine learning