Suppr超能文献

训练期间的随机效应:对基于深度学习的医学图像分割的影响。

Random effects during training: Implications for deep learning-based medical image segmentation.

机构信息

Clinical Physiology, Department of Clinical Sciences Lund, Lund University, Lund, Sweden; Department of Biomedical Engineering, Faculty of Engineering, Lund University, Lund, Sweden.

Clinical Physiology, Department of Clinical Sciences Lund, Lund University, Lund, Sweden; Department of Biomedical Engineering, Faculty of Engineering, Lund University, Lund, Sweden; Wallenberg Centre for Molecular Medicine, Lund University, Lund, Sweden.

出版信息

Comput Biol Med. 2024 Sep;180:108944. doi: 10.1016/j.compbiomed.2024.108944. Epub 2024 Aug 2.

Abstract

BACKGROUND

A single learning algorithm can produce deep learning-based image segmentation models that vary in performance purely due to random effects during training. This study assessed the effect of these random performance fluctuations on the reliability of standard methods of comparing segmentation models.

METHODS

The influence of random effects during training was assessed by running a single learning algorithm (nnU-Net) with 50 different random seeds for three multiclass 3D medical image segmentation problems, including brain tumour, hippocampus, and cardiac segmentation. Recent literature was sampled to find the most common methods for estimating and comparing the performance of deep learning segmentation models. Based on this, segmentation performance was assessed using both hold-out validation and 5-fold cross-validation and the statistical significance of performance differences was measured using the Paired t-test and the Wilcoxon signed rank test on Dice scores.

RESULTS

For the different segmentation problems, the seed producing the highest mean Dice score statistically significantly outperformed between 0 % and 76 % of the remaining seeds when estimating performance using hold-out validation, and between 10 % and 38 % when estimating performance using 5-fold cross-validation.

CONCLUSION

Random effects during training can cause high rates of statistically-significant performance differences between segmentation models from the same learning algorithm. Whilst statistical testing is widely used in contemporary literature, our results indicate that a statistically-significant difference in segmentation performance is a weak and unreliable indicator of a true performance difference between two learning algorithms.

摘要

背景

由于训练过程中的随机效应,单一学习算法可以产生性能差异很大的深度学习图像分割模型。本研究评估了这些随机性能波动对比较分割模型的标准方法可靠性的影响。

方法

通过对三个多类 3D 医学图像分割问题(脑肿瘤、海马体和心脏分割)运行 50 个不同随机种子的单个学习算法(nnU-Net),评估训练过程中随机效应的影响。从最近的文献中采样,以找到评估和比较深度学习分割模型性能的最常见方法。在此基础上,使用留一验证和 5 倍交叉验证评估分割性能,并使用 Paired t 检验和 Wilcoxon 符号秩检验测量 Dice 得分上的性能差异的统计学显著性。

结果

对于不同的分割问题,使用留一验证估计性能时,产生最高平均 Dice 得分的种子在统计学上显著优于剩余种子的 0%至 76%,而使用 5 倍交叉验证估计性能时,优于 10%至 38%。

结论

训练过程中的随机效应可能导致来自同一学习算法的分割模型之间出现高比例的统计学显著性能差异。虽然统计检验在当代文献中被广泛使用,但我们的结果表明,分割性能的统计学显著差异是两个学习算法之间真实性能差异的一个弱且不可靠的指标。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验