Gottlich Harrison C, Korfiatis Panagiotis, Gregory Adriana V, Kline Timothy L
Mayo Clinic Alix School of Medicine, Mayo Clinic, Rochester, MN, United States.
Department of Radiology, Mayo Clinic, Rochester, MN, United States.
Front Radiol. 2023 Sep 15;3:1223294. doi: 10.3389/fradi.2023.1223294. eCollection 2023.
Methods that automatically flag poor performing predictions are drastically needed to safely implement machine learning workflows into clinical practice as well as to identify difficult cases during model training.
Disagreement between the fivefold cross-validation sub-models was quantified using dice scores between folds and summarized as a surrogate for model confidence. The summarized Interfold Dices were compared with thresholds informed by human interobserver values to determine whether final ensemble model performance should be manually reviewed.
The method on all tasks efficiently flagged poor segmented images without consulting a reference standard. Using the median Interfold Dice for comparison, substantial dice score improvements after excluding flagged images was noted for the in-domain CT (0.85 ± 0.20 to 0.91 ± 0.08, 8/50 images flagged) and MR (0.76 ± 0.27 to 0.85 ± 0.09, 8/50 images flagged). Most impressively, there were dramatic dice score improvements in the simulated out-of-distribution task where the model was trained on a radical nephrectomy dataset with different contrast phases predicting a partial nephrectomy all cortico-medullary phase dataset (0.67 ± 0.36 to 0.89 ± 0.10, 122/300 images flagged).
Comparing interfold sub-model disagreement against human interobserver values is an effective and efficient way to assess automated predictions when a reference standard is not available. This functionality provides a necessary safeguard to patient care important to safely implement automated medical image segmentation workflows.
为了将机器学习工作流程安全地应用于临床实践,并在模型训练过程中识别困难病例,迫切需要能够自动标记性能不佳预测结果的方法。
使用各折之间的骰子分数对五折交叉验证子模型之间的差异进行量化,并将其汇总作为模型置信度的替代指标。将汇总的折间骰子分数与基于人类观察者间值确定的阈值进行比较,以确定是否应手动审查最终集成模型的性能。
该方法在所有任务中都能有效地标记分割不佳的图像,而无需参考标准。使用中位数折间骰子分数进行比较,在排除标记图像后,域内CT(从0.85±0.20提高到0.91±0.08,标记8/50张图像)和MR(从0.76±0.27提高到0.85±0.09,标记8/50张图像)的骰子分数有显著提高。最令人印象深刻的是,在模拟的分布外任务中,骰子分数有显著提高,该任务中模型在根治性肾切除术数据集上进行训练,具有不同的对比期,预测部分肾切除术全皮质-髓质期数据集(从0.67±0.36提高到0.89±0.10,标记122/300张图像)。
当没有参考标准时,将折间子模型差异与人类观察者间值进行比较是评估自动预测的一种有效方法。此功能为患者护理提供了必要的保障,这对于安全实施自动医学图像分割工作流程很重要。