优化特征选择与先进机器学习用于预测冠状动脉疾病血运重建患者的卒中风险

Optimized feature selection and advanced machine learning for stroke risk prediction in revascularized coronary artery disease patients.

作者信息

Si Yong, Abdollahi Armin, Ashrafi Negin, Placencia Greg, Pishgar Elham, Alaei Kamiar, Pishgar Maryam

机构信息

University of Southern California, Los Angeles, CA, USA.

California State Polytechnic University, Pomona, CA, USA.

出版信息

BMC Med Inform Decis Mak. 2025 Jul 24;25(1):276. doi: 10.1186/s12911-025-03116-2.

DOI:10.1186/s12911-025-03116-2

PMID:40707947

Abstract

BACKGROUND

Coronary artery disease (CAD) remains a leading cause of global mortality, with stroke constituting a significant complication among patients undergoing coronary revascularization procedures, such as percutaneous coronary intervention (PCI) or coronary artery bypass grafting (CABG). Previous research has demonstrated the successful application of machine learning (ML) in predicting various postoperative outcomes, including poor prognosis following cardiac surgery and the risk of postoperative stroke. Despite these advancements, a critical gap persists in studies quantitatively linking the risk of postoperative stroke to revascularization using ML-based approaches. This study aims to address this gap by developing and validating ML models to predict the risk of stroke in CAD patients undergoing coronary revascularization, with the ultimate goal of enhancing clinical decision-making and improving patient outcomes.

METHODS

We developed an ML framework to predict stroke risk in patients with CAD undergoing revascularization. A total of 5,757 patients were extracted from the Medical Information Mart for Intensive Care IV (MIMIC-IV) database. Feature selection was performed using a combination of Pearson correlation analysis, least absolute shrinkage and selection operator (LASSO), ridge regression, and elastic net. Initially, 35 features were identified based on expert opinion and a comprehensive literature review; the integrated results of the feature selection methods reduced the feature set to 14. The dataset was randomly divided into training, testing, and validation subsets with proportions of 70%, 15%, and 15%, respectively. Several ML models were evaluated, including logistic regression, XGBoost, random forest, AdaBoost, Bernoulli naive Bayes, k-nearest neighbors (KNN), and CatBoost. Model performance was assessed using the area under the receiver operating characteristic curve (AUC-ROC), accuracy, and 500 bootstrapped 95% confidence intervals (CIs) to ensure robust evaluation.

RESULTS

The CatBoost model demonstrated superior performance, achieving an AUC of 0.8486 (95% CI: 0.8124-0.8797) on the test set and 0.8511 (95% CI: 0.8203-0.8793) on the validation set. Shapley Additive Explanations (SHAP) analysis identified the Charlson Comorbidity Index (CCI), length of stay (LOS), and treatment types as the most influential predictors. Notably, compared to the best existing literature, which reported an AUC of 0.760 on the test set, our model exhibited a 9% improvement in predictive performance while utilizing a more parsimonious feature set.

CONCLUSION

By integrating four feature selection methods, we significantly streamlined the feature set, resulting in a more efficient and reliable predictive model. We propose the CatBoost model for the prediction of postoperative stroke in patients with CAD undergoing coronary revascularization. With its high accuracy, the proposed model offers valuable insights for medical practitioners, enabling informed decision-making and the implementation of preventive measures to mitigate stroke risk.

摘要

背景

冠状动脉疾病（CAD）仍然是全球死亡的主要原因，在接受冠状动脉血运重建手术（如经皮冠状动脉介入治疗（PCI）或冠状动脉旁路移植术（CABG））的患者中，中风是一种重要的并发症。先前的研究已经证明机器学习（ML）在预测各种术后结果方面的成功应用，包括心脏手术后的不良预后和术后中风的风险。尽管有这些进展，但在使用基于ML的方法将术后中风风险与血运重建进行定量关联的研究中，仍然存在关键差距。本研究旨在通过开发和验证ML模型来预测接受冠状动脉血运重建的CAD患者的中风风险，最终目标是加强临床决策并改善患者预后。

方法

我们开发了一个ML框架来预测接受血运重建的CAD患者的中风风险。从重症监护医学信息数据库IV（MIMIC-IV）中提取了总共5757名患者。使用Pearson相关分析、最小绝对收缩和选择算子（LASSO）、岭回归和弹性网络的组合进行特征选择。最初，根据专家意见和全面的文献综述确定了35个特征；特征选择方法的综合结果将特征集减少到14个。数据集被随机分为训练集、测试集和验证集，比例分别为70%、15%和15%。评估了几种ML模型，包括逻辑回归、XGBoost、随机森林、AdaBoost、伯努利朴素贝叶斯、k近邻（KNN）和CatBoost。使用受试者操作特征曲线下面积（AUC-ROC）、准确性和500次自举95%置信区间（CIs）评估模型性能，以确保进行稳健的评估。

结果

CatBoost模型表现出卓越的性能，在测试集上的AUC为0.8486（95%CI：0.8124-0.8797），在验证集上为0.8511（95%CI：0.8203-0.8793）。Shapley加法解释（SHAP）分析确定Charlson合并症指数（CCI）、住院时间（LOS）和治疗类型是最具影响力的预测因素。值得注意的是，与现有最佳文献相比，该文献在测试集上报告的AUC为0.760，我们的模型在使用更简约的特征集时预测性能提高了9%。