通过可解释的机器学习算法对急性缺血性脑卒中进行预测病因分类：一项多中心前瞻性队列研究。

Predictive etiological classification of acute ischemic stroke through interpretable machine learning algorithms: a multicenter, prospective cohort study.

机构信息

Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, No.119 South 4th Ring West Road, Fengtai District, Beijing, 100070, China.

China National Clinical Research Center for Neurological Diseases, No.119 South 4th Ring West Road, Fengtai District, Beijing, 100070, China.

出版信息

BMC Med Res Methodol. 2024 Sep 10;24(1):199. doi: 10.1186/s12874-024-02331-1.

DOI:10.1186/s12874-024-02331-1

PMID:39256656

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11384709/

Abstract

BACKGROUND

The prognosis, recurrence rates, and secondary prevention strategies varied significantly among different subtypes of acute ischemic stroke (AIS). Machine learning (ML) techniques can uncover intricate, non-linear relationships within medical data, enabling the identification of factors associated with etiological classification. However, there is currently a lack of research utilizing ML algorithms for predicting AIS etiology.

OBJECTIVE

We aimed to use interpretable ML algorithms to develop AIS etiology prediction models, identify critical factors in etiology classification, and enhance existing clinical categorization.

METHODS

This study involved patients with the Third China National Stroke Registry (CNSR-III). Nine models, which included Natural Gradient Boosting (NGBoost), Categorical Boosting (CatBoost), Extreme Gradient Boosting (XGBoost), Random Forest (RF), Light Gradient Boosting Machine (LGBM), Gradient Boosting Decision Tree (GBDT), Adaptive Boosting (AdaBoost), Support Vector Machine (SVM), and logistic regression (LR), were employed to predict large artery atherosclerosis (LAA), small vessel occlusion (SVO), and cardioembolism (CE) using an 80:20 randomly split training and test set. We designed an SFS-XGB with 10-fold cross-validation for feature selection. The primary evaluation metrics for the models included the area under the receiver operating characteristic curve (AUC) for discrimination and the Brier score (or calibration plots) for calibration.

RESULTS

A total of 5,213 patients were included, comprising 2,471 (47.4%) with LAA, 2,153 (41.3%) with SVO, and 589 (11.3%) with CE. In both LAA and SVO models, the AUC values of the ML models were significantly higher than that of the LR model (P < 0.001). The optimal model for predicting SVO (AUC [RF model] = 0.932) outperformed the optimal LAA model (AUC [NGB model] = 0.917) and the optimal CE model (AUC [LGBM model] = 0.846). Each model displayed relatively satisfactory calibration. Further analysis showed that the optimal CE model could identify potential CE patients in the undetermined etiology (SUE) group, accounting for 1,900 out of 4,156 (45.7%).

CONCLUSIONS

The ML algorithm effectively classified patients with LAA, SVO, and CE, demonstrating superior classification performance compared to the LR model. The optimal ML model can identify potential CE patients among SUE patients. These newly identified predictive factors may complement the existing etiological classification system, enabling clinicians to promptly categorize stroke patients' etiology and initiate optimal strategies for secondary prevention.

摘要

背景

不同类型的急性缺血性脑卒中（AIS）患者的预后、复发率和二级预防策略存在显著差异。机器学习（ML）技术可以揭示医学数据中的复杂、非线性关系，从而识别与病因分类相关的因素。然而，目前利用 ML 算法预测 AIS 病因的研究还很少。

目的

我们旨在使用可解释的 ML 算法来建立 AIS 病因预测模型，确定病因分类中的关键因素，并增强现有的临床分类。

方法

本研究纳入了中国第三次国家卒中登记研究（CNSR-III）的患者。我们使用 9 种模型，包括自然梯度提升（NGBoost）、分类梯度提升（CatBoost）、极端梯度提升（XGBoost）、随机森林（RF）、轻梯度提升机（LGBM）、梯度提升决策树（GBDT）、自适应提升（AdaBoost）、支持向量机（SVM）和逻辑回归（LR），通过 80:20 的随机训练集和测试集来预测大动脉粥样硬化（LAA）、小血管闭塞（SVO）和心源性栓塞（CE）。我们使用 10 折交叉验证设计了 SFS-XGB 来进行特征选择。模型的主要评估指标包括鉴别诊断的受试者工作特征曲线下面积（AUC）和校准评分（或校准图）。

结果

共纳入 5213 例患者，其中 2471 例（47.4%）为 LAA，2153 例（41.3%）为 SVO，589 例（11.3%）为 CE。在 LAA 和 SVO 模型中，ML 模型的 AUC 值均显著高于 LR 模型（P<0.001）。预测 SVO 的最佳模型（RF 模型 AUC=0.932）优于预测 LAA 的最佳模型（NGB 模型 AUC=0.917）和预测 CE 的最佳模型（LGBM 模型 AUC=0.846）。每个模型的校准效果都比较理想。进一步分析表明，最佳的 CE 模型可以在病因不确定（SUE）组中识别潜在的 CE 患者，占 4156 例中的 1900 例（45.7%）。