Lee Hayeon, Hwang Seung Ha, Park Seoyoung, Choi Yunjeong, Lee Sooji, Park Jaeyu, Son Yejun, Kim Hyeon Jin, Kim Soeun, Oh Jiyeon, Smith Lee, Pizzol Damiano, Rhee Sang Youl, Sang Hyunji, Lee Jinseok, Yon Dong Keon
Center for Digital Health, Medical Science Research Institute, Kyung Hee University Medical Center, Kyung Hee University College of Medicine, Seoul, South Korea.
Department of Biomedical Engineering, Kyung Hee University, Yongin, South Korea.
EClinicalMedicine. 2025 Jan 18;80:103069. doi: 10.1016/j.eclinm.2025.103069. eCollection 2025 Feb.
Type 2 diabetes mellitus (T2DM) is a significant global public health concern that has steadily increased over the past few decades. Thus, this study aimed to predict the incidence of T2DM within 5 years and the risk of mortality following the onset of T2DM. Data from three independent cohorts worldwide were used.
We utilized data from three independent, large-scale, general population-based, and worldwide cohort studies. The Korean cohort (NHIS-NSC cohort; discovery cohort; n = 973,303), conducted between 1 January, 2002 and 31 December, 2013, was used for training and internal validation, whereas the Japanese cohort (JMDC cohort; validation cohort A; n = 12,143,715) and UK cohort (UK Biobank; validation cohort B; n = 416,656) were used for external validation. We employed various machine learning (ML)-based models, using 18 features, to predict the incidence of T2DM within five years of regular health checkups and calculated the Shapley Additive Explanation (SHAP) values. To ensure the robustness of our ML-based prediction model, we investigated the potential association between the model probability divided into tertiles and the risk of mortality following the onset of T2DM.
In the discovery cohort, the ensemble model using voting with logistic regression and adaptive boosting achieved a balanced accuracy of 72.6% and an area under the receiver operating characteristics curve (AUROC) of 0.792. The SHAP value analysis of our proposed model revealed that age was the most important predictor of incident T2DM, followed by fasting blood glucose, hemoglobin, γ-glutamyl transferase level, and body mass index. The model probability is associated with an increased risk of mortality (T1: adjusted hazard ratio, 2.82 [95% CI, 2.01-3.94]; T2: 3.89 [2.74-5.53]; and T3: 7.73 [5.37-11.12]). Similar patterns and trends were observed in the validation cohorts (T1: 1.74 [1.49-2.03], T2: 1.97 [1.69-2.30], and T3: 3.31 [2.82-3.38] in validation cohort A; T1: 1.33 [1.03-1.71], T2: 1.54 [1.21-1.96], and T3: 1.73 [1.36-2.20] in validation cohort B).
This study derived and validated an ML-based model to predict the incidence of T2DM within 5 years across three countries (South Korea, Japan, and the UK), showing that the model probability is associated with an increased risk of mortality.
Institute of Information & Communications Technology Planning & Evaluation, South Korea.
2型糖尿病(T2DM)是一个重大的全球公共卫生问题,在过去几十年中呈稳步上升趋势。因此,本研究旨在预测5年内T2DM的发病率以及T2DM发病后的死亡风险。使用了来自全球三个独立队列的数据。
我们利用了三项独立的、大规模的、基于一般人群的全球队列研究的数据。韩国队列(NHIS - NSC队列;发现队列;n = 973,303),在2002年1月1日至2013年12月31日期间进行,用于训练和内部验证,而日本队列(JMDC队列;验证队列A;n = 12,143,715)和英国队列(英国生物银行;验证队列B;n = 416,656)用于外部验证。我们采用了各种基于机器学习(ML)的模型,使用18个特征,来预测定期健康检查5年内T2DM的发病率,并计算了夏普利加法解释(SHAP)值。为确保我们基于ML的预测模型的稳健性,我们研究了分为三分位数的模型概率与T2DM发病后的死亡风险之间的潜在关联。
在发现队列中,使用逻辑回归和自适应增强投票的集成模型实现了72.6%的平衡准确率和0.792的受试者工作特征曲线下面积(AUROC)。我们提出的模型的SHAP值分析表明,年龄是新发T2DM最重要的预测因素,其次是空腹血糖、血红蛋白、γ-谷氨酰转移酶水平和体重指数。模型概率与死亡风险增加相关(T1:调整后的危险比,2.82 [95% CI,2.01 - 3.94];T2:3.89 [2.74 - 5.53];T3:7.73 [5.37 - 11.12])。在验证队列中观察到类似的模式和趋势(验证队列A中T1:1.74 [1.49 - 2.03],T2:1.97 [1.69 - 2.30],T3:3.31 [2.82 - 3.38];验证队列B中T1:1.33 [1.03 - 1.71],T2:1.54 [1.21 - 1.96],T3:1.73 [1.36 - 2.20])。
本研究推导并验证了一个基于ML的模型,用于预测三个国家(韩国、日本和英国)5年内T2DM的发病率,表明模型概率与死亡风险增加相关。
韩国信息通信技术规划与评估研究所。