Schott Brayden, Santoro-Fernandes Victor, Klaneček Žan, Perlman Scott, Jeraj Robert
Department of Medical Physics, School of Medicine and Public Health, University of Wisconsin, Madison, WI, United States of America.
Faculty of Mathematics and Physics, University of Ljubljana, Ljubljana, Slovenia.
Phys Med Biol. 2025 May 23;70(11). doi: 10.1088/1361-6560/add9df.
Deep learning models are increasingly being implemented for automated medical image analysis to inform patient care. Most models, however, lack uncertainty information, without which the reliability of model outputs cannot be ensured. Several uncertainty quantification (UQ) methods exist to capture model uncertainty. Yet, it is not clear which method is optimal for a given task. The purpose of this work was to investigate several commonly used UQ methods for the critical yet understudied task of metastatic lesion segmentation on whole body PET/CT.59 whole bodyGa-DOTATATE PET/CT images of patients undergoing theranostic treatment of metastatic neuroendocrine tumors were used in this work. A 3D U-Net was trained for lesion segmentation following five-fold cross validation. Uncertainty measures derived from four UQ methods-probability entropy, Monte Carlo dropout, deep ensembles, and test time augmentation-were investigated. Each uncertainty measure was assessed across four quantitative evaluations: (1) its ability to detect artificially degraded image data at low, medium, and high degradation magnitudes; (2) to detect false-positive (FP) predicted regions; (3) to recover false-negative (FN) predicted regions; and (4) to establish correlations with model biomarker extraction and segmentation performance metrics.Test time augmentation and probability entropy respectively achieved the highest and lowest degraded image detection at low (AUC = 0.54 vs. 0.68), medium (AUC = 0.70 vs. 0.82), and high (AUC = 0.83 vs. 0.90) degradation magnitudes. For detecting FPs, all UQ methods achieve strong performance, with AUC values ranging narrowly between 0.77 and 0.81. FN region recovery performance was strongest for test time augmentation and weakest for probability entropy. Performance for the correlation analysis was mixed, where the strongest performance was achieved by test time augmentation for SUVcapture (ρ= 0.57) and segmentation Dice coefficient (ρ= 0.72), by Monte Carlo dropout for SUVcapture (ρ= 0.35), and by probability entropy for segmentation cross entropy (ρ= 0.96).Overall, test time augmentation demonstrated superior UQ performance and is recommended for use in metastatic lesion segmentation task. It also offers the advantage of being post hoc and computationally efficient. In contrast, probability entropy performed the worst, highlighting the need for advanced UQ approaches for this task.
深度学习模型越来越多地被应用于医学图像自动分析,以辅助患者护理。然而,大多数模型缺乏不确定性信息,没有这些信息就无法确保模型输出的可靠性。存在几种不确定性量化(UQ)方法来捕捉模型的不确定性。然而,尚不清楚哪种方法对于给定任务是最优的。这项工作的目的是研究几种常用的UQ方法,用于全身PET/CT上转移性病变分割这一关键但研究不足的任务。本研究使用了59例接受转移性神经内分泌肿瘤诊疗的患者的全身Ga-DOTATATE PET/CT图像。在五折交叉验证后,训练了一个3D U-Net用于病变分割。研究了源自四种UQ方法的不确定性度量——概率熵、蒙特卡洛随机失活、深度集成和测试时增强。在四项定量评估中对每种不确定性度量进行了评估:(1)其在低、中、高退化程度下检测人工退化图像数据的能力;(2)检测假阳性(FP)预测区域的能力;(3)恢复假阴性(FN)预测区域的能力;(4)建立与模型生物标志物提取和分割性能指标的相关性。在低(AUC = 0.54对0.68)、中(AUC = 0.70对0.82)和高(AUC = 0.83对0.90)退化程度下,测试时增强和概率熵分别实现了最高和最低的退化图像检测。对于检测FP,所有UQ方法都表现出较强的性能,AUC值在0.77至0.81之间窄幅波动。对于FN区域恢复性能,测试时增强最强,概率熵最弱。相关性分析的性能参差不齐,其中测试时增强在SUV摄取(ρ = 0.57)和分割Dice系数(ρ = 0.72)方面表现最强,蒙特卡洛随机失活在SUV摄取(ρ = 0.35)方面表现最强,概率熵在分割交叉熵(ρ = 0.96)方面表现最强。总体而言,测试时增强表现出卓越的UQ性能,建议用于转移性病变分割任务。它还具有事后处理和计算效率高的优点。相比之下,概率熵表现最差,凸显了针对该任务采用先进UQ方法的必要性。