Zhang Yongsheng, Zhang Hongyu, Wang Dawei, Li Na, Lv Haoyue, Zhang Guang
Department of Health Management, The First Affiliated Hospital of Shandong First Medical University & Shandong Provincial Qianfoshan Hospital, Jinan, China.
Shandong Engineering Research Center of Health Management, Shandong Institute of Health Management, The First Affiliated Hospital of Shandong First Medical University & Shandong Provincial Qianfoshan Hospital, Jinan, China.
J Med Internet Res. 2025 May 9;27:e73190. doi: 10.2196/73190.
Diabetes has emerged as a critical global public health crisis. Prediabetes, as the transitional phase with 5%-10% annual progression to diabetes, offers a critical window for intervention. The lack of a 5-year risk prediction model for diabetes progression among Chinese individuals with prediabetes limits clinical decision-making support.
This study aimed to develop and validate a machine learning-based 5-year risk prediction model of progression from prediabetes to diabetes for the Chinese population and establish an interactive web-based platform to facilitate high-risk patients identifying and early targeted interventions, ultimately reducing diabetes incidence and health care burdens.
A retrospective cohort study was conducted on 2 prediabetes cohorts from 2 Chinese medical centers (primary cohort: n=6578 and external validation cohort: n=2333) tracking from 2019 to 2024. Participants meeting the American Diabetes Association (ADA) criteria (prediabetes: hemoglobin A1c [HbA1c] level of 5.7%-6.4%; diabetes: HbA1c level of ≥6.5%) were identified. A total of 42 variables (demographics, physical measures, and hematologic biomarkers) were collected using standardized protocols. Patients were split into the training (70%) and test (30%) sets randomly in the primary cohort. Significant predictors were selected on the training set using recursive feature elimination methods, followed by prediction model development using 7 machine learning algorithms (logistic regression, random forest, support vector machine, multilayer perceptron, extreme gradient boosting machine, light gradient boosting machine, and categorical boosting machine [CatBoost]), optimized through grid search and 5-fold cross-validation. Model performance was assessed using the receiver operating characteristic curve, the precision-recall curves, accuracy, sensitivity, and specificity as well as multiple other metrics on both the test set and the external test set.
During the follow-up of 5 years, 2610 (41.6%) participants and 760 (35.2%) participants progressed from prediabetes to diabetes, with mean annual progression rates of 8.34% and 7.04% in the primary cohort and the external cohort, respectively. Using 14 features selected using the recursive feature elimination-logistic algorithm, the CatBoost model achieved optimal performance in the test set and the external test set with an area under the receiver operating characteristic curve of 0.819 and 0.807, respectively. It also showed the best discrimination performance on the accuracy, negative predictive value (NPV), and F1-scores as well as the calibration performances in both the test set and the external test set. Then the Shapley Additive Explanations (SHAP) analysis highlighted the top 6 predictors (FBG, HDL, ALT/AST, BMI, age, and MONO), enabling targeted modification of these risk factors to reduce diabetes incidence.
We developed a 5-year risk prediction model of progression from prediabetes to diabetes for the Chinese population, with the CatBoost model showing the best predictive performance, which could effectively identify individuals at high risk of diabetes.
糖尿病已成为全球重大公共卫生危机。糖尿病前期作为每年有5%-10%进展为糖尿病的过渡阶段,提供了一个关键的干预窗口。缺乏针对中国糖尿病前期个体糖尿病进展的5年风险预测模型限制了临床决策支持。
本研究旨在开发并验证基于机器学习的中国人群从糖尿病前期进展为糖尿病的5年风险预测模型,并建立一个基于网络的交互式平台,以促进高危患者识别和早期靶向干预,最终降低糖尿病发病率和医疗负担。
对来自2个中国医疗中心的2个糖尿病前期队列(原始队列:n=6578;外部验证队列:n=2333)进行回顾性队列研究,随访时间为2019年至2024年。确定符合美国糖尿病协会(ADA)标准(糖尿病前期:糖化血红蛋白[HbA1c]水平为5.7%-6.4%;糖尿病:HbA1c水平≥6.5%)的参与者。使用标准化方案收集了总共42个变量(人口统计学、身体测量指标和血液学生物标志物)。在原始队列中,患者被随机分为训练集(70%)和测试集(30%)。使用递归特征消除方法在训练集上选择显著预测因子,然后使用7种机器学习算法(逻辑回归、随机森林、支持向量机、多层感知器、极端梯度提升机、轻梯度提升机和分类提升机[CatBoost])开发预测模型,并通过网格搜索和5折交叉验证进行优化。在测试集和外部测试集上,使用受试者工作特征曲线、精确召回率曲线、准确性、敏感性、特异性以及其他多个指标评估模型性能。
在5年随访期间,2610名(41.6%)参与者和760名(35.2%)参与者从糖尿病前期进展为糖尿病,原始队列和外部队列的年平均进展率分别为8.34%和7.04%。使用通过递归特征消除-逻辑算法选择的14个特征,CatBoost模型在测试集和外部测试集中表现出最佳性能,受试者工作特征曲线下面积分别为0.819和0.807。它在准确性、阴性预测值(NPV)和F1分数以及测试集和外部测试集的校准性能方面也表现出最佳的区分性能。然后,Shapley加性解释(SHAP)分析突出了前6个预测因子(空腹血糖、高密度脂蛋白、谷丙转氨酶/谷草转氨酶、体重指数、年龄和单核细胞),通过针对性改变这些风险因素可降低糖尿病发病率。
我们开发了中国人群从糖尿病前期进展为糖尿病的5年风险预测模型,其中CatBoost模型表现出最佳预测性能,可有效识别糖尿病高危个体。