Mudanjiang Medical University, Mudanjiang, China.
Shihezi University, Shihezi, China.
Sci Rep. 2024 Nov 1;14(1):26355. doi: 10.1038/s41598-024-75898-w.
This study aimed to construct a high-performance prediction and diagnosis model for type 2 diabetic retinopathy (DR) and identify key correlates of DR. This study utilized a cross-sectional dataset of 3,000 patients from the People's Liberation Army General Hospital in 2021. Logistic regression was used as the baseline model to compare the prediction performance of the machine learning model and the related factors. The recursive feature elimination cross-validation (RFECV) algorithm was used to select features. Four machine learning models, support vector machine (SVM), decision tree (DT), random forest (RF), and gradient boost decision tree (GBDT), were developed to predict DR. The models were optimized using grid search to determine hyperparameters, and the model with superior performance was selected. Shapley-additive explanations (SHAP) were used to analyze the important correlation factors of DR. Among the four machine learning models, the optimal model was GBDT, with predicted accuracy, precision, recall, F1-measure, and AUC values of 0.7883, 0.8299, 0.7539, 0.7901, and 0.8672, respectively. Six key correlates of DR were identified, including rapid micronutrient protein/creatinine measurement, 24-h micronutrient protein, fasting C-peptide, glycosylated hemoglobin, blood urea, and creatinine. The logistic model had 27 risk factors, with an AUC value of 0.8341. A superior prediction model was constructed that identified easily explainable key factors. The number of correlation factors was significantly lower compared to traditional statistical methods, leading to a more accurate prediction performance than the latter.
本研究旨在构建 2 型糖尿病视网膜病变(DR)的高性能预测和诊断模型,并确定 DR 的关键相关因素。本研究利用 2021 年来自解放军总医院的 3000 名患者的横断面数据集。逻辑回归被用作基线模型,以比较机器学习模型和相关因素的预测性能。递归特征消除交叉验证(RFECV)算法用于选择特征。开发了支持向量机(SVM)、决策树(DT)、随机森林(RF)和梯度提升决策树(GBDT)四种机器学习模型来预测 DR。使用网格搜索优化模型以确定超参数,并选择性能优越的模型。Shapley 加性解释(SHAP)用于分析 DR 的重要相关因素。在这四种机器学习模型中,最优模型是 GBDT,其预测准确性、精度、召回率、F1 度量和 AUC 值分别为 0.7883、0.8299、0.7539、0.7901 和 0.8672。确定了 DR 的六个关键相关因素,包括快速微量营养素蛋白/肌酐测定、24 小时微量营养素蛋白、空腹 C 肽、糖化血红蛋白、尿素和肌酐。逻辑模型有 27 个风险因素,AUC 值为 0.8341。构建了一个具有优越预测性能的模型,确定了易于解释的关键因素。与传统统计方法相比,相关因素的数量显著降低,预测性能优于后者。