Gilani Neda, Somi Mohammadhossein, Hamidi Farzaneh, Santaguida Pasqualina, Faramarzi Elnaz, Arabi Belaghi Reza
Department of Statistics and Epidemiology, Faculty of Health, Tabriz University of Medical Sciences, Tabriz, Iran.
Liver and Gastrointestinal Diseases Research Center, Tabriz University of Medical Sciences, Tabriz, Iran.
Health Promot Perspect. 2025 May 6;15(1):82-92. doi: 10.34172/hpp.025.43105. eCollection 2025 May.
This study aimed to identify some risk factors associated with time to diabetes type II events using artificial intelligence (AI) survival models (SM) in a population cohort from East Azerbaijan, Iran.
Data from Azar-Cohort spanning from 2014 to 2020 was analyzed using the random forest (RF) variable selection method along with Cox regression to identify the most relevant risk factors associated with diabetes. We then developed prediction models using RF survival analysis. Lasso-variable selection and RF variable selection were used to select the most important variables. The concordance index (C-index) was used to evaluate the concordance of the prediction models.
Our LASSO-Cox regression identified six factors to be significantly associated with diabetes: age, mean corpuscular hemoglobin concentration (MCHC), waist circumference (WC), body mass index (BMI), use of sleep medication, and hypertension stage 1 and stage 2. The model included all variables with a C-index of 76.3%. In contrast, the RF analysis identified 21 important variables predicting a higher probability of having diabetes. Of those, WC, MCHC, triglyceride, and age were the most important predictors of diabetes. The RF model converged after 500 trees with an out-of-bag (OOB) of 0.28 and a C-index of 79.5%.
RF machine learning algorithms and LASSO-Cox regression analyses consistently identified WC, hypertension, and MCHC as the main risk factors for developing diabetes. The RF approach demonstrated slightly better accuracy in predicting the likelihood of diabetes at different time points.
本研究旨在利用人工智能(AI)生存模型(SM),在伊朗东阿塞拜疆省的人群队列中,确定与II型糖尿病发病时间相关的一些风险因素。
使用随机森林(RF)变量选择方法和Cox回归分析2014年至2020年阿扎尔队列的数据,以确定与糖尿病最相关的风险因素。然后,我们使用RF生存分析开发预测模型。采用套索变量选择和RF变量选择来选择最重要的变量。一致性指数(C指数)用于评估预测模型的一致性。
我们的套索Cox回归确定了六个与糖尿病显著相关的因素:年龄、平均红细胞血红蛋白浓度(MCHC)、腰围(WC)、体重指数(BMI)、睡眠药物使用情况以及1期和2期高血压。该模型纳入了所有变量,C指数为76.3%。相比之下,RF分析确定了21个预测患糖尿病概率较高的重要变量。其中,WC、MCHC、甘油三酯和年龄是糖尿病最重要的预测因素。RF模型在500棵树后收敛,袋外估计值(OOB)为0.28,C指数为79.5%。
RF机器学习算法和套索Cox回归分析一致确定WC、高血压和MCHC是患糖尿病的主要风险因素。RF方法在预测不同时间点患糖尿病的可能性方面表现出略高的准确性。