Lolak Sermkiat, Attia John, McKay Gareth J, Thakkinstian Ammarin
Department of Clinical Epidemiology and Biostatistics, Faculty of Medicine, Ramathibodi Hospital, Mahidol University, Bangkok, Thailand.
Centre for Clinical Epidemiology and Biostatistics, School of Medicine and Public Health, Hunter Medical Research Institute, University of Newcastle, New South Wales, Australia.
JMIR Cardio. 2023 Jul 26;7:e47736. doi: 10.2196/47736.
Stroke has multiple modifiable and nonmodifiable risk factors and represents a leading cause of death globally. Understanding the complex interplay of stroke risk factors is thus not only a scientific necessity but a critical step toward improving global health outcomes.
We aim to assess the performance of explainable machine learning models in predicting stroke risk factors using real-world cohort data by comparing explainable machine learning models with conventional statistical methods.
This retrospective cohort included high-risk patients from Ramathibodi Hospital in Thailand between January 2010 and December 2020. We compared the performance and explainability of logistic regression (LR), Cox proportional hazard, Bayesian network (BN), tree-augmented Naïve Bayes (TAN), extreme gradient boosting (XGBoost), and explainable boosting machine (EBM) models. We used multiple imputation by chained equations for missing data and discretized continuous variables as needed. Models were evaluated using C-statistics and F-scores.
Out of 275,247 high-risk patients, 9659 (3.5%) experienced a stroke. XGBoost demonstrated the highest performance with a C-statistic of 0.89 and an F-score of 0.80 followed by EBM and TAN with C-statistics of 0.87 and 0.83, respectively; LR and BN had similar C-statistics of 0.80. Significant factors associated with stroke included atrial fibrillation (AF), hypertension (HT), antiplatelets, HDL, and age. AF, HT, and antihypertensive medication were common significant factors across most models, with AF being the strongest factor in LR, XGBoost, BN, and TAN models.
Our study developed stroke prediction models to identify crucial predictive factors such as AF, HT, or systolic blood pressure or antihypertensive medication, anticoagulant medication, HDL, age, and statin use in high-risk patients. The explainable XGBoost was the best model in predicting stroke risk, followed by EBM.
中风有多种可改变和不可改变的风险因素,是全球主要的死亡原因之一。因此,了解中风风险因素之间复杂的相互作用不仅是科学上的必要,也是改善全球健康结果的关键一步。
我们旨在通过将可解释机器学习模型与传统统计方法进行比较,评估可解释机器学习模型在使用真实队列数据预测中风风险因素方面的性能。
这项回顾性队列研究纳入了2010年1月至2020年12月期间泰国拉玛蒂博迪医院的高危患者。我们比较了逻辑回归(LR)、Cox比例风险模型、贝叶斯网络(BN)、树增强朴素贝叶斯(TAN)、极端梯度提升(XGBoost)和可解释增强机器(EBM)模型的性能和可解释性。我们使用链式方程多重填补法处理缺失数据,并根据需要对连续变量进行离散化。模型使用C统计量和F分数进行评估。
在275247名高危患者中,9659名(3.5%)发生了中风。XGBoost表现最佳,C统计量为0.89,F分数为0.80,其次是EBM和TAN,C统计量分别为0.87和0.83;LR和BN的C统计量相似,均为0.80。与中风相关的显著因素包括心房颤动(AF)、高血压(HT)、抗血小板药物、高密度脂蛋白(HDL)和年龄。AF、HT和抗高血压药物是大多数模型中常见的显著因素,AF是LR、XGBoost、BN和TAN模型中最强的因素。
我们的研究开发了中风预测模型,以识别高危患者中诸如AF、HT或收缩压、抗高血压药物、抗凝药物、HDL、年龄和他汀类药物使用等关键预测因素。可解释的XGBoost是预测中风风险的最佳模型,其次是EBM。