Department of Pathology, Radboud Institute for Health Sciences, Radboud University Medical Center, Nijmegen, The Netherlands.
Inify Laboratories AB, Stockholm, Sweden.
Med Image Anal. 2023 Jan;83:102655. doi: 10.1016/j.media.2022.102655. Epub 2022 Oct 17.
Machine learning model deployment in clinical practice demands real-time risk assessment to identify situations in which the model is uncertain. Once deployed, models should be accurate for classes seen during training while providing informative estimates of uncertainty to flag abnormalities and unseen classes for further analysis. Although recent developments in uncertainty estimation have resulted in an increasing number of methods, a rigorous empirical evaluation of their performance on large-scale digital pathology datasets is lacking. This work provides a benchmark for evaluating prevalent methods on multiple datasets by comparing the uncertainty estimates on both in-distribution and realistic near and far out-of-distribution (OOD) data on a whole-slide level. To this end, we aggregate uncertainty values from patch-based classifiers to whole-slide level uncertainty scores. We show that results found in classical computer vision benchmarks do not always translate to the medical imaging setting. Specifically, we demonstrate that deep ensembles perform best at detecting far-OOD data but can be outperformed on a more challenging near-OOD detection task by multi-head ensembles trained for optimal ensemble diversity. Furthermore, we demonstrate the harmful impact OOD data can have on the performance of deployed machine learning models. Overall, we show that uncertainty estimates can be used to discriminate in-distribution from OOD data with high AUC scores. Still, model deployment might require careful tuning based on prior knowledge of prospective OOD data.
机器学习模型在临床实践中的部署需要实时风险评估,以识别模型不确定的情况。一旦部署,模型应该对训练期间看到的类别具有准确性,并提供有关不确定性的信息性估计,以标记异常情况和看不见的类别以进行进一步分析。尽管不确定性估计的最新进展导致了越来越多的方法,但缺乏对这些方法在大规模数字病理学数据集上的性能进行严格的实证评估。这项工作通过在整个幻灯片级别比较分布内和现实的近分布外 (OOD) 和远 OOD 数据的不确定性估计,为在多个数据集上评估流行方法提供了基准。为此,我们将基于补丁的分类器的不确定性值聚合到整个幻灯片级别的不确定性得分中。我们表明,在经典计算机视觉基准中找到的结果并不总是适用于医学成像环境。具体来说,我们证明深度集成在检测远 OOD 数据方面表现最佳,但通过针对最佳集成多样性进行训练的多头集成,可以在更具挑战性的近 OOD 检测任务中超越它。此外,我们展示了 OOD 数据对已部署机器学习模型性能的有害影响。总体而言,我们表明不确定性估计可以用于使用高 AUC 分数区分分布内和 OOD 数据。尽管如此,模型部署可能需要根据预期 OOD 数据的先验知识进行仔细调整。