Cardiovascular Research Center, Cardiovascular Research Institute, Isfahan University of Medical Sciences, Isfahan, Iran.
Biostatistics and Epidemiology Department, School of Health, Isfahan University of Medical Sciences, Isfahan, Iran.
BMC Med Inform Decis Mak. 2023 Apr 19;23(1):72. doi: 10.1186/s12911-023-02169-5.
Cardiovascular diseases (CVD) are the predominant cause of early death worldwide. Identification of people with a high risk of being affected by CVD is consequential in CVD prevention. This study adopts Machine Learning (ML) and statistical techniques to develop classification models for predicting the future occurrence of CVD events in a large sample of Iranians.
We used multiple prediction models and ML techniques with different abilities to analyze the large dataset of 5432 healthy people at the beginning of entrance into the Isfahan Cohort Study (ICS) (1990-2017). Bayesian additive regression trees enhanced with "missingness incorporated in attributes" (BARTm) was run on the dataset with 515 variables (336 variables without and the remaining with up to 90% missing values). In the other used classification algorithms, variables with more than 10% missing values were excluded, and MissForest imputes the missing values of the remaining 49 variables. We used Recursive Feature Elimination (RFE) to select the most contributing variables. Random oversampling technique, recommended cut-point by precision-recall curve, and relevant evaluation metrics were used for handling unbalancing in the binary response variable.
This study revealed that age, systolic blood pressure, fasting blood sugar, two-hour postprandial glucose, diabetes mellitus, history of heart disease, history of high blood pressure, and history of diabetes are the most contributing factors for predicting CVD incidence in the future. The main differences between the results of classification algorithms are due to the trade-off between sensitivity and specificity. Quadratic Discriminant Analysis (QDA) algorithm presents the highest accuracy (75.50 ± 0.08) but the minimum sensitivity (49.84 ± 0.25); In contrast, decision trees provide the lowest accuracy (51.95 ± 0.69) but the top sensitivity (82.52 ± 1.22). BARTm.90% resulted in 69.48 ± 0.28 accuracy and 54.00 ± 1.66 sensitivity without any preprocessing step.
This study confirmed that building a prediction model for CVD in each region is valuable for screening and primary prevention strategies in that specific region. Also, results showed that using conventional statistical models alongside ML algorithms makes it possible to take advantage of both techniques. Generally, QDA can accurately predict the future occurrence of CVD events with a fast (inference speed) and stable (confidence values) procedure. The combined ML and statistical algorithm of BARTm provide a flexible approach without any need for technical knowledge about assumptions and preprocessing steps of the prediction procedure.
心血管疾病(CVD)是全球范围内导致早逝的主要原因。识别出患有 CVD 的高风险人群对于 CVD 的预防至关重要。本研究采用机器学习(ML)和统计技术,为伊朗大样本人群未来 CVD 事件的发生建立分类模型。
我们使用了多种预测模型和 ML 技术,对 5432 名健康人在伊斯法罕队列研究(ICS)(1990-2017 年)开始时的大型数据集进行了分析。贝叶斯加性回归树增强了“属性中包含缺失值”(BARTm),对包含 515 个变量(336 个变量没有缺失值,其余变量缺失值高达 90%)的数据集进行了分析。在其他使用的分类算法中,排除了缺失值超过 10%的变量,而 MissForest 则对其余 49 个缺失值的变量进行了插补。我们使用递归特征消除(RFE)来选择最有贡献的变量。随机过采样技术、基于精确召回曲线的推荐切点和相关评估指标被用于处理二分类响应变量的不平衡问题。
本研究表明,年龄、收缩压、空腹血糖、餐后 2 小时血糖、糖尿病、心脏病史、高血压史和糖尿病史是预测未来 CVD 发病的最重要因素。分类算法结果的主要差异是由于敏感性和特异性之间的权衡。二次判别分析(QDA)算法具有最高的准确性(75.50±0.08),但敏感性最低(49.84±0.25);相反,决策树提供了最低的准确性(51.95±0.69),但敏感性最高(82.52±1.22)。BARTm.90%在没有任何预处理步骤的情况下,得出了 69.48±0.28 的准确性和 54.00±1.66 的敏感性。
本研究证实,在每个地区建立 CVD 预测模型对于该地区的筛查和初级预防策略是有价值的。此外,结果表明,使用传统统计模型和 ML 算法可以结合两种技术的优势。一般来说,QDA 可以快速(推断速度)、稳定(置信值)地准确预测未来 CVD 事件的发生。BARTm 的 ML 和统计算法的组合提供了一种灵活的方法,无需了解预测过程的假设和预处理步骤的技术知识。