Dong Tim, Sinha Shubhra, Zhai Ben, Fudulu Daniel, Chan Jeremy, Narayan Pradeep, Judge Andy, Caputo Massimo, Dimagli Arnaldo, Benedetto Umberto, Angelini Gianni D
Bristol Heart Institute, Translational Health Sciences, University of Bristol, Bristol, United Kingdom.
School of Computing Science, Northumbria University, Newcastle upon Tyne, United Kingdom.
JMIRx Med. 2024 Jun 12;5:e45973. doi: 10.2196/45973.
The Society of Thoracic Surgeons and European System for Cardiac Operative Risk Evaluation (EuroSCORE) II risk scores are the most commonly used risk prediction models for in-hospital mortality after adult cardiac surgery. However, they are prone to miscalibration over time and poor generalization across data sets; thus, their use remains controversial. Despite increased interest, a gap in understanding the effect of data set drift on the performance of machine learning (ML) over time remains a barrier to its wider use in clinical practice. Data set drift occurs when an ML system underperforms because of a mismatch between the data it was developed from and the data on which it is deployed.
In this study, we analyzed the extent of performance drift using models built on a large UK cardiac surgery database. The objectives were to (1) rank and assess the extent of performance drift in cardiac surgery risk ML models over time and (2) investigate any potential influence of data set drift and variable importance drift on performance drift.
We conducted a retrospective analysis of prospectively, routinely gathered data on adult patients undergoing cardiac surgery in the United Kingdom between 2012 and 2019. We temporally split the data 70:30 into a training and validation set and a holdout set. Five novel ML mortality prediction models were developed and assessed, along with EuroSCORE II, for relationships between and within variable importance drift, performance drift, and actual data set drift. Performance was assessed using a consensus metric.
A total of 227,087 adults underwent cardiac surgery during the study period, with a mortality rate of 2.76% (n=6258). There was strong evidence of a decrease in overall performance across all models (P<.0001). Extreme gradient boosting (clinical effectiveness metric [CEM] 0.728, 95% CI 0.728-0.729) and random forest (CEM 0.727, 95% CI 0.727-0.728) were the overall best-performing models, both temporally and nontemporally. EuroSCORE II performed the worst across all comparisons. Sharp changes in variable importance and data set drift from October to December 2017, from June to July 2018, and from December 2018 to February 2019 mirrored the effects of performance decrease across models.
All models show a decrease in at least 3 of the 5 individual metrics. CEM and variable importance drift detection demonstrate the limitation of logistic regression methods used for cardiac surgery risk prediction and the effects of data set drift. Future work will be required to determine the interplay between ML models and whether ensemble models could improve on their respective performance advantages.
胸外科医师协会(Society of Thoracic Surgeons)和欧洲心脏手术风险评估系统(EuroSCORE)II风险评分是成人心脏手术后院内死亡率最常用的风险预测模型。然而,随着时间的推移,它们容易出现校准错误,并且在不同数据集之间的泛化能力较差;因此,其应用仍存在争议。尽管人们对此的兴趣日益增加,但了解数据集漂移对机器学习(ML)性能随时间的影响方面的差距,仍然是其在临床实践中更广泛应用的障碍。当ML系统由于其开发所依据的数据与所部署的数据不匹配而表现不佳时,就会发生数据集漂移。
在本研究中,我们使用基于大型英国心脏手术数据库构建的模型分析了性能漂移的程度。目标是:(1)对心脏手术风险ML模型随时间的性能漂移程度进行排名和评估;(2)研究数据集漂移和变量重要性漂移对性能漂移的任何潜在影响。
我们对2012年至2019年期间在英国接受心脏手术的成年患者的前瞻性常规收集数据进行了回顾性分析。我们将数据按70:30的时间比例划分为训练集和验证集以及保留集。开发并评估了五个新型ML死亡率预测模型以及EuroSCORE II,以研究变量重要性漂移、性能漂移和实际数据集漂移之间以及内部的关系。使用共识指标评估性能。
在研究期间,共有227,087名成年人接受了心脏手术,死亡率为2.76%(n = 6258)。有强有力的证据表明所有模型的整体性能均有所下降(P <.0001)。极端梯度提升(临床有效性指标[CEM] 0.728,95% CI 0.728 - 0.729)和随机森林(CEM 0.727,95% CI 0.727 - 0.728)是总体上在时间和非时间方面表现最佳的模型。在所有比较中,EuroSCORE II的表现最差。2017年10月至12月、2018年6月至7月以及2018年12月至2019年2月期间,变量重要性和数据集漂移的急剧变化反映了各模型性能下降的影响。
所有模型在5个个体指标中至少有3个出现下降。CEM和变量重要性漂移检测证明了用于心脏手术风险预测的逻辑回归方法的局限性以及数据集漂移的影响。未来需要开展工作来确定ML模型之间的相互作用,以及集成模型是否可以在其各自的性能优势基础上有所改进。