Smith Hayley, Sweeting Michael, Morris Tim, Crowther Michael J
Department of Health Sciences, University of Leicester, Leicester, LE1 7RH, UK.
Statistical Innovation, Oncology Biometrics, Oncology R&D, AstraZeneca, Cambridge, UK.
Diagn Progn Res. 2022 Jun 2;6(1):10. doi: 10.1186/s41512-022-00124-y.
There is substantial interest in the adaptation and application of so-called machine learning approaches to prognostic modelling of censored time-to-event data. These methods must be compared and evaluated against existing methods in a variety of scenarios to determine their predictive performance. A scoping review of how machine learning methods have been compared to traditional survival models is important to identify the comparisons that have been made and issues where they are lacking, biased towards one approach or misleading.
We conducted a scoping review of research articles published between 1 January 2000 and 2 December 2020 using PubMed. Eligible articles were those that used simulation studies to compare statistical and machine learning methods for risk prediction with a time-to-event outcome in a medical/healthcare setting. We focus on data-generating mechanisms (DGMs), the methods that have been compared, the estimands of the simulation studies, and the performance measures used to evaluate them.
A total of ten articles were identified as eligible for the review. Six of the articles evaluated a method that was developed by the authors, four of which were machine learning methods, and the results almost always stated that this developed method's performance was equivalent to or better than the other methods compared. Comparisons were often biased towards the novel approach, with the majority only comparing against a basic Cox proportional hazards model, and in scenarios where it is clear it would not perform well. In many of the articles reviewed, key information was unclear, such as the number of simulation repetitions and how performance measures were calculated.
It is vital that method comparisons are unbiased and comprehensive, and this should be the goal even if realising it is difficult. Fully assessing how newly developed methods perform and how they compare to a variety of traditional statistical methods for prognostic modelling is imperative as these methods are already being applied in clinical contexts. Evaluations of the performance and usefulness of recently developed methods for risk prediction should be continued and reporting standards improved as these methods become increasingly popular.
所谓的机器学习方法在截尾事件发生时间数据的预后建模中的适应性和应用受到了广泛关注。必须在各种场景下将这些方法与现有方法进行比较和评估,以确定它们的预测性能。对机器学习方法与传统生存模型的比较方式进行范围综述,对于识别已进行的比较以及它们存在不足、偏向一种方法或具有误导性的问题很重要。
我们使用PubMed对2000年1月1日至2020年12月2日发表的研究文章进行了范围综述。符合条件的文章是那些使用模拟研究在医疗/卫生保健环境中比较统计和机器学习方法用于风险预测并以事件发生时间作为结果的文章。我们关注数据生成机制(DGM)、被比较的方法、模拟研究的估计量以及用于评估它们的性能指标。
总共确定了10篇文章符合综述条件。其中6篇文章评估了作者开发的一种方法,其中4种是机器学习方法,结果几乎总是表明这种开发的方法的性能等同于或优于所比较的其他方法。比较往往偏向新方法,大多数只与基本的Cox比例风险模型进行比较,并且是在明显该模型表现不佳的场景下。在许多被综述的文章中,关键信息不明确,例如模拟重复的次数以及性能指标是如何计算的。
方法比较必须无偏且全面,即使实现这一点很困难,这也应该是目标。全面评估新开发的方法的性能以及它们与各种传统统计方法在预后建模方面的比较情况至关重要,因为这些方法已经在临床环境中得到应用。随着这些方法越来越受欢迎,应继续对最近开发的风险预测方法的性能和实用性进行评估,并改进报告标准。