Department of Endocrinology, People's Hospital of Wanning, Wanning, Hainan Province, China.
Department of Industrial Design, Hubei University of Technology, Wuhan, Hubei Province, China.
BMJ Open. 2023 May 30;13(5):e072991. doi: 10.1136/bmjopen-2023-072991.
The prevalence of diabetes has increased globally, leading to a significant disease burden and financial cost. Early prediction is crucial to control its prevalence.
A prospective cohort study.
National representative study on Irish.
8504 individuals aged 50 years or older were included.
Surveys were conducted to collect over 40 000 variables related to social, financial, health, mental and family status. Feature selection was performed using logistic regression. Different machine/deep learning algorithms were trained, including distributed random forest, extremely randomised trees, a generalised linear model with regularisation, a gradient boosting machine and a deep neural network. These algorithms were integrated into a stacked ensemble to generate the best model. The model was tested using various metrics, such as the area under the curve (AUC), log loss, mean per classification error, mean square error (MSE) and root MSE (RMSE). The SHapley Additive exPlanations (SHAP) method was used to interpret the established model.
After 2 years, 105 baseline features were identified as major contributors to diabetes risk, including sex, low-density lipoprotein cholesterol and cirrhosis. The best model achieved high accuracy, robustness and discrimination in predicting diabetes risk, with an AUC of 0.854, log loss of 0.187, mean per classification error of 0.267, RMSE of 0.229 and MSE of 0.052 in the independent test set. The model was also shown to be well calibrated. The SHAP algorithm provided insights into the decision-making process of the model.
These findings could help physicians in the early identification of high-risk patients and implement targeted interventions to reduce diabetes incidence.
全球糖尿病患病率不断上升,导致疾病负担和经济成本显著增加。早期预测对于控制其流行至关重要。
前瞻性队列研究。
爱尔兰全国代表性研究。
纳入 8504 名年龄在 50 岁或以上的个体。
进行了调查,以收集与社会、财务、健康、心理和家庭状况相关的 40000 多个变量。使用逻辑回归进行特征选择。训练了不同的机器/深度学习算法,包括分布式随机森林、极度随机树、正则化广义线性模型、梯度提升机和深度神经网络。这些算法被整合到一个堆叠集成中,以生成最佳模型。使用各种指标,如曲线下面积(AUC)、对数损失、每分类误差的平均值、均方误差(MSE)和根均方误差(RMSE),对模型进行测试。使用 SHapley Additive exPlanations(SHAP)方法来解释所建立的模型。
2 年后,确定了 105 个基线特征,这些特征是导致糖尿病风险的主要因素,包括性别、低密度脂蛋白胆固醇和肝硬化。最佳模型在预测糖尿病风险方面具有较高的准确性、稳健性和区分度,在独立测试集中 AUC 为 0.854、对数损失为 0.187、每分类误差的平均值为 0.267、RMSE 为 0.229 和 MSE 为 0.052。模型也表现出良好的校准。SHAP 算法提供了模型决策过程的深入了解。
这些发现可以帮助医生早期识别高危患者,并实施有针对性的干预措施,以降低糖尿病的发病率。