Department of Biostatistics, School of Allied Medical Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Food Technology Research Institute, Faculty of Nutrition Sciences and Food Technology, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Sci Rep. 2024 Sep 27;14(1):22230. doi: 10.1038/s41598-024-72819-9.
Cardiovascular disease (CVD) can often lead to serious consequences such as death or disability. This study aims to identify a tree-based machine learning method with the best performance criteria for the detection of CVD. This study analyzed data collected from 9,499 participants, with a focus on 38 different variables. The target variable was the presence of cardiovascular disease (CVD) and the villages were considered as the cluster variable. The standard tree, random forest, Generalized Linear Mixed Model tree (GLMM tree), and Generalized Mixed Effect random forest (GMERF) were fitted to the data and the estimated prediction power indices were compared to identify the best approach. According to the analysis of important variables in all models, five variables (age, LDL, history of cardiac disease in first-degree relatives, physical activity level, and presence of hypertension) were identified as the most influential in predicting CVD. Fitting the decision tree, random forest, GLMM tree, and GMERF, respectively, resulted in an area under the ROC curve of 0.56, 0.73, 0.78, and 0.80. The GMERF model demonstrated the best predictive performance among the fitted models based on evaluation criteria. Regarding the clustered structure of the data, using relevant machine-learning approaches that account for this clustering may result in more accurate predicting indices and targeted prevention frameworks.
心血管疾病(CVD)可能会导致严重后果,如死亡或残疾。本研究旨在确定一种基于树的机器学习方法,该方法具有最佳的性能标准,可用于检测 CVD。本研究分析了来自 9499 名参与者的数据,重点关注 38 个不同的变量。目标变量是心血管疾病(CVD)的存在,村庄被视为聚类变量。对标准树、随机森林、广义线性混合模型树(GLMM 树)和广义混合效应随机森林(GMERF)进行拟合,并比较估计的预测能力指标,以确定最佳方法。根据所有模型中重要变量的分析,确定了五个变量(年龄、LDL、一级亲属的心脏病史、身体活动水平和高血压的存在)是预测 CVD 的最具影响力的因素。分别拟合决策树、随机森林、GLMM 树和 GMERF,ROC 曲线下面积分别为 0.56、0.73、0.78 和 0.80。根据评估标准,GMERF 模型在拟合模型中表现出最佳的预测性能。关于数据的聚类结构,使用考虑到这种聚类的相关机器学习方法可能会产生更准确的预测指标和有针对性的预防框架。