Abstract
Machine learning (ML) models are widely used in big data analysis. In particular, boosting algorithms have shown the potential to achieve predictive performance comparable to that of traditional statistical models through iterative learning. However, a notable limitation of ML models is that their performance can be limited by small sample sizes or a lack of sufficient explanatory variables. In this paper, we investigate how the predictive performance of ML models varies with different sample sizes and numbers of explanatory variables within the framework of a generalized linear model. The simulation study demonstrated that the performance of ML models improves as both the sample size, and the number of explanatory variables increases. These results were also observed in the analysis of real-world datasets.
Abstract
Although osteoporosis and osteoarthritis have different symptoms, there is the common feature that both have high prevalence rate among adult women. However, it is challenging to receive a cost-effective and consistent diagnosis for bone-related diseases, since it is expensive to take MRI or CT examinations of osteoporosis and the measurement instruments for osteoarthritis provide divergent criteria for bone density. In this study, we propose a predictive model for diagnosing osteoarthritis and osteoporosis in women based on Artificial Intelligence algorithms using health survey data. Three Artificial Intelligence algorithms, such as Logistic Regression, Random Forest Classifier, and eXtreme Gradient Boosting machine, are considered for our provided prediction models. Because the health survey data we used is imbalanced, under-sampling technique was applied to improve the model's performance. In addition, various feature sets were selected to reduce the dimensionality of independent variables. We identified that the prediction model based on the eXtreme Gradient Boosting machine algorithm, which uses the dataset applied by under-sampling method, exhibits the best performance.