Shamsutdinova Diana, Stamate Daniel, Stahl Daniel
Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, United Kingdom.
Data Science and Soft Computing Lab, Computing Department, Goldsmiths University of London, United Kingdom; School of Health Sciences, University of Manchester, Manchester, United Kingdom.
Int J Med Inform. 2025 Feb;194:105700. doi: 10.1016/j.ijmedinf.2024.105700. Epub 2024 Nov 10.
Accurate and interpretable models are essential for clinical decision-making, where predictions can directly impact patient care. Machine learning (ML) survival methods can handle complex multidimensional data and achieve high accuracy but require post-hoc explanations. Traditional models such as the Cox Proportional Hazards Model (Cox-PH) are less flexible, but fast, stable, and intrinsically transparent. Moreover, ML does not always outperform Cox-PH in clinical settings, warranting a diligent model validation. We aimed to develop a set of R functions to help explore the limits of Cox-PH compared to the tree-based and deep learning survival models for clinical prediction modelling, employing ensemble learning and nested cross-validation.
We developed a set of R functions, publicly available as the package "survcompare". It supports Cox-PH and Cox-Lasso, and Survival Random Forest (SRF) and DeepHit are the ML alternatives, along with the ensemble methods integrating Cox-PH with SRF or DeepHit designed to isolate the marginal value of ML. The package performs a repeated nested cross-validation and tests for statistical significance of the ML's superiority using the survival-specific performance metrics, the concordance index, time-dependent AUC-ROC and calibration slope. To get practical insights, we applied this methodology to clinical and simulated datasets with varying complexities and sizes.
In simulated data with non-linearities or interactions, ML models outperformed Cox-PH at sample sizes ≥ 500. ML superiority was also observed in imaging and high-dimensional clinical data. However, for tabular clinical data, the performance gains of ML were minimal; in some cases, regularised Cox-Lasso recovered much of the ML's performance advantage with significantly faster computations. Ensemble methods combining Cox-PH and ML predictions were instrumental in quantifying Cox-PH's limits and improving ML calibration. Traditional models like Cox-PH or Cox-Lasso should not be overlooked while developing clinical predictive models from tabular data or data of limited size.
Our package offers researchers a framework and practical tool for evaluating the accuracy-interpretability trade-off, helping make informed decisions about model selection.
准确且可解释的模型对于临床决策至关重要,因为预测结果会直接影响患者护理。机器学习(ML)生存方法能够处理复杂的多维数据并实现高精度,但需要事后解释。像Cox比例风险模型(Cox-PH)这样的传统模型灵活性较差,但速度快、稳定性好且本质上具有透明度。此外,在临床环境中,ML并不总是优于Cox-PH,因此需要进行严格的模型验证。我们旨在开发一组R函数,通过集成学习和嵌套交叉验证,帮助探索Cox-PH与基于树的和深度学习生存模型相比在临床预测建模中的局限性。
我们开发了一组R函数,以“survcompare”包的形式公开提供。它支持Cox-PH和Cox-Lasso,ML替代方法包括生存随机森林(SRF)和深度命中(DeepHit),以及将Cox-PH与SRF或DeepHit集成的集成方法,旨在分离ML的边际价值。该包执行重复的嵌套交叉验证,并使用特定于生存的性能指标、一致性指数、时间依赖的AUC-ROC和校准斜率来测试ML优越性的统计显著性。为了获得实际见解,我们将这种方法应用于具有不同复杂性和规模的临床和模拟数据集。
在具有非线性或交互作用的模拟数据中,样本量≥500时,ML模型优于Cox-PH。在成像和高维临床数据中也观察到了ML优越性。然而,对于表格临床数据,ML的性能提升很小;在某些情况下,正则化的Cox-Lasso以显著更快的计算速度恢复了ML的许多性能优势。结合Cox-PH和ML预测的集成方法有助于量化Cox-PH的局限性并改善ML校准。在从表格数据或规模有限的数据开发临床预测模型时,不应忽视像Cox-PH或Cox-Lasso这样的传统模型。
我们的包为研究人员提供了一个评估准确性与可解释性权衡的框架和实用工具,有助于在模型选择方面做出明智决策。