Nkemdirim Okere Arinze, Li Tianfeng, Theran Carlos, Nyasani Eunice, Ali Askal Ayalew
College of Pharmacy, The University of Iowa, 180 South Grand Ave, 366B College of Pharmacy Building (CPB), Iowa City, IA, 52242, USA.
Economic, Social, and Administrative Pharmacy (ESAP), College of Pharmacy and Pharmaceutical Sciences, Institute of Public Health, Florida A&M University, Tallahassee, FL, 32307, USA.
Comput Biol Med. 2025 Mar;187:109824. doi: 10.1016/j.compbiomed.2025.109824. Epub 2025 Feb 11.
Over one-third of the population in the United States (US) has prediabetes. Unfortunately, underserved population in the United States face a higher burden of prediabetes compared to urban areas, increasing the risk of stroke and heart disease. There is a gap in the literature in understanding early predictors of diabetes among patients with prediabetes living in underserved communities in the United States. Hence, this study's objective is to identify factors influencing the transition from prediabetes to diabetes in rural or underserved communities using a machine learning approach.
We conducted a retrospective analysis of data from prediabetic patients between 2012 and 2022. Eligible participants were at least 18 years old with baseline HbA1c levels between 5.7 % and 6.4 %. Eleven machine learning algorithms were evaluated using ten-fold cross-validation, including Logistic Regression (LR), Support Vector Classifier (SVC), K-nearest Neighbor (KNN), Gaussian Naive Bayes (GaussianNB), Bernoulli Naive Bayes (BernoulliNB), Adaptive Boosting (AdaBoost), Decision Tree (DT), Random Forest (RF), Gradient Boosting (GB), Extreme Gradient Boosting (XGBoost), and Extra Trees (ET). Subsequently, the SHAP framework was used to assess predictor influence and interactions observed with the top model.
Out of 5816 patients, 1910 met the criteria, with 426 progressing to diabetes. The Random Forest model achieved the highest accuracy (90.0 %) and AUC (0.963), followed by Extra Trees (89.5 % accuracy, AUC 0.962) and XGBoost (88.6 % accuracy, AUC 0.952). Logistic Regression demonstrated lower performance but outperformed other models such as K-Nearest Neighbors and Gaussian Naive Bayes. SHAP analysis with the RF model identified key predictors and their interactions. A significant interaction showed that lower BMI values, combined with increasing age, were associated with a reduced risk of diabetes progression, while higher BMI at younger ages increased the likelihood of progression. Additionally, several social determinants of health were identified as significant predictors.
Among the 11 models, the Random Forest model showed the strongest reliability for predicting diabetes progression. The results of this study can be used to inform public policy implications for the development of early, targeted interventions focusing on social determinants of health, dietary counseling, and BMI management to prevent diabetes in underserved communities.
在美国,超过三分之一的人口患有糖尿病前期。不幸的是,与城市地区相比,美国服务不足地区的人群面临着更高的糖尿病前期负担,这增加了中风和心脏病的风险。在了解美国服务不足社区中糖尿病前期患者糖尿病早期预测因素方面,文献存在空白。因此,本研究的目的是使用机器学习方法确定影响农村或服务不足社区中糖尿病前期向糖尿病转变的因素。
我们对2012年至2022年间糖尿病前期患者的数据进行了回顾性分析。符合条件的参与者年龄至少为18岁,基线糖化血红蛋白(HbA1c)水平在5.7%至6.4%之间。使用十折交叉验证评估了11种机器学习算法,包括逻辑回归(LR)、支持向量分类器(SVC)、K近邻(KNN)、高斯朴素贝叶斯(GaussianNB)、伯努利朴素贝叶斯(BernoulliNB)、自适应提升(AdaBoost)、决策树(DT)、随机森林(RF)、梯度提升(GB)、极端梯度提升(XGBoost)和额外树(ET)。随后,使用SHAP框架评估预测因素的影响以及与顶级模型观察到的相互作用。
在5816名患者中,1910名符合标准,其中426名进展为糖尿病。随机森林模型的准确率最高(90.0%)和曲线下面积(AUC)最大(0.963),其次是额外树(准确率89.5%,AUC 0.962)和XGBoost(准确率88.6%,AUC 0.952)。逻辑回归表现较低,但优于其他模型,如K近邻和高斯朴素贝叶斯。使用随机森林模型的SHAP分析确定了关键预测因素及其相互作用。一个显著的相互作用表明,较低的体重指数(BMI)值与年龄增长相结合,与糖尿病进展风险降低相关,而年轻时较高的BMI增加了进展的可能性。此外,一些健康的社会决定因素被确定为显著的预测因素。
在这11种模型中,随机森林模型在预测糖尿病进展方面显示出最强的可靠性。本研究结果可用于为公共政策提供信息,以制定早期、有针对性的干预措施,重点关注健康的社会决定因素、饮食咨询和BMI管理,以预防服务不足社区的糖尿病。