Department of Radiation Oncology (Maastro), GROW School for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, The Netherlands.
Clinical Data Science, Maastricht University, Maastricht, The Netherlands.
Sci Rep. 2023 Oct 24;13(1):18176. doi: 10.1038/s41598-023-45486-5.
In the past decade, there has been a sharp increase in publications describing applications of convolutional neural networks (CNNs) in medical image analysis. However, recent reviews have warned of the lack of reproducibility of most such studies, which has impeded closer examination of the models and, in turn, their implementation in healthcare. On the other hand, the performance of these models is highly dependent on decisions on architecture and image pre-processing. In this work, we assess the reproducibility of three studies that use CNNs for head and neck cancer outcome prediction by attempting to reproduce the published results. In addition, we propose a new network structure and assess the impact of image pre-processing and model selection criteria on performance. We used two publicly available datasets: one with 298 patients for training and validation and another with 137 patients from a different institute for testing. All three studies failed to report elements required to reproduce their results thoroughly, mainly the image pre-processing steps and the random seed. Our model either outperforms or achieves similar performance to the existing models with considerably fewer parameters. We also observed that the pre-processing efforts significantly impact the model's performance and that some model selection criteria may lead to suboptimal models. Although there have been improvements in the reproducibility of deep learning models, our work suggests that wider implementation of reporting standards is required to avoid a reproducibility crisis.
在过去的十年中,描述卷积神经网络(CNN)在医学图像分析中的应用的出版物数量急剧增加。然而,最近的评论警告称,大多数此类研究缺乏可重复性,这阻碍了对模型的更仔细检查,进而阻碍了它们在医疗保健中的实施。另一方面,这些模型的性能高度依赖于架构和图像预处理的决策。在这项工作中,我们通过尝试重现已发表的结果来评估三篇使用 CNN 进行头颈部癌症预后预测的研究的可重复性。此外,我们提出了一种新的网络结构,并评估了图像预处理和模型选择标准对性能的影响。我们使用了两个公开可用的数据集:一个包含 298 名患者用于训练和验证,另一个包含来自另一个研究所的 137 名患者用于测试。三篇研究均未报告全面重现其结果所需的要素,主要是图像预处理步骤和随机种子。我们的模型表现要么优于现有模型,要么表现相当,而参数却少得多。我们还观察到预处理工作对模型性能有重大影响,并且某些模型选择标准可能导致次优模型。尽管深度学习模型的可重复性已经有所提高,但我们的工作表明,需要更广泛地实施报告标准,以避免可重复性危机。