Sahlsten Jaakko, Jaskari Joel, Wahid Kareem A, Ahmed Sara, Glerean Enrico, He Renjie, Kann Benjamin H, Mäkitie Antti, Fuller Clifton D, Naser Mohamed A, Kaski Kimmo
Department of Computer Science, Aalto University School of Science, Espoo, Finland.
Department of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX USA.
medRxiv. 2023 Feb 24:2023.02.20.23286188. doi: 10.1101/2023.02.20.23286188.
Oropharyngeal cancer (OPC) is a widespread disease, with radiotherapy being a core treatment modality. Manual segmentation of the primary gross tumor volume (GTVp) is currently employed for OPC radiotherapy planning, but is subject to significant interobserver variability. Deep learning (DL) approaches have shown promise in automating GTVp segmentation, but comparative (auto)confidence metrics of these models predictions has not been well-explored. Quantifying instance-specific DL model uncertainty is crucial to improving clinician trust and facilitating broad clinical implementation. Therefore, in this study, probabilistic DL models for GTVp auto-segmentation were developed using large-scale PET/CT datasets, and various uncertainty auto-estimation methods were systematically investigated and benchmarked.
We utilized the publicly available 2021 HECKTOR Challenge training dataset with 224 co-registered PET/CT scans of OPC patients with corresponding GTVp segmentations as a development set. A separate set of 67 co-registered PET/CT scans of OPC patients with corresponding GTVp segmentations was used for external validation. Two approximate Bayesian deep learning methods, the MC Dropout Ensemble and Deep Ensemble, both with five submodels, were evaluated for GTVp segmentation and uncertainty performance. The segmentation performance was evaluated using the volumetric Dice similarity coefficient (DSC), mean surface distance (MSD), and Hausdorff distance at 95% (95HD). The uncertainty was evaluated using four measures from literature: coefficient of variation (CV), structure expected entropy, structure predictive entropy, and structure mutual information, and additionally with our novel measure. The utility of uncertainty information was evaluated with the accuracy of uncertainty-based segmentation performance prediction using the Accuracy vs Uncertainty (AvU) metric, and by examining the linear correlation between uncertainty estimates and DSC. In addition, batch-based and instance-based referral processes were examined, where the patients with high uncertainty were rejected from the set. In the batch referral process, the area under the referral curve with DSC (R-DSC AUC) was used for evaluation, whereas in the instance referral process, the DSC at various uncertainty thresholds were examined.
Both models behaved similarly in terms of the segmentation performance and uncertainty estimation. Specifically, the MC Dropout Ensemble had 0.776 DSC, 1.703 mm MSD, and 5.385 mm 95HD. The Deep Ensemble had 0.767 DSC, 1.717 mm MSD, and 5.477 mm 95HD. The uncertainty measure with the highest DSC correlation was structure predictive entropy with correlation coefficients of 0.699 and 0.692 for the MC Dropout Ensemble and the Deep Ensemble, respectively. The highest AvU value was 0.866 for both models. The best performing uncertainty measure for both models was the CV which had R-DSC AUC of 0.783 and 0.782 for the MC Dropout Ensemble and Deep Ensemble, respectively. With referring patients based on uncertainty thresholds from 0.85 validation DSC for all uncertainty measures, on average the DSC improved from the full dataset by 4.7% and 5.0% while referring 21.8% and 22% patients for MC Dropout Ensemble and Deep Ensemble, respectively.
We found that many of the investigated methods provide overall similar but distinct utility in terms of predicting segmentation quality and referral performance. These findings are a critical first-step towards more widespread implementation of uncertainty quantification in OPC GTVp segmentation.
口咽癌(OPC)是一种常见疾病,放射治疗是其核心治疗方式。目前,原发性大体肿瘤体积(GTVp)的手动分割用于OPC放射治疗计划,但存在显著的观察者间差异。深度学习(DL)方法在自动化GTVp分割方面显示出前景,但这些模型预测的比较(自动)置信度指标尚未得到充分探索。量化特定实例的DL模型不确定性对于提高临床医生的信任度和促进广泛的临床应用至关重要。因此,在本研究中,我们使用大规模PET/CT数据集开发了用于GTVp自动分割的概率DL模型,并系统地研究和比较了各种不确定性自动估计方法。
我们利用公开可用的2021年HECKTOR挑战赛训练数据集,其中包含224例OPC患者的PET/CT扫描图像及相应的GTVp分割结果作为开发集。另外一组包含67例OPC患者的PET/CT扫描图像及相应GTVp分割结果的数据集用于外部验证。评估了两种近似贝叶斯深度学习方法,即MC Dropout集成模型和深度集成模型,二者均有五个子模型,用于GTVp分割及不确定性性能评估。分割性能通过体积骰子相似系数(DSC)、平均表面距离(MSD)和95% Hausdorff距离(95HD)进行评估。不确定性通过文献中的四种指标进行评估:变异系数(CV)、结构期望熵、结构预测熵和结构互信息,另外还使用了我们提出的新指标。不确定性信息的效用通过使用准确性与不确定性(AvU)指标基于不确定性的分割性能预测准确性进行评估,并通过检查不确定性估计与DSC之间的线性相关性来评估。此外,还研究了基于批次和基于实例的转诊流程,即排除不确定性高的患者。在批次转诊流程中,使用带有DSC的转诊曲线下面积(R-DSC AUC)进行评估,而在实例转诊流程中,检查不同不确定性阈值下的DSC。
两种模型在分割性能和不确定性估计方面表现相似。具体而言,MC Dropout集成模型的DSC为0.776,MSD为1.703 mm,95HD为5.385 mm。深度集成模型的DSC为0.767,MSD为1.717 mm,95HD为5.477 mm。与DSC相关性最高的不确定性指标是结构预测熵,MC Dropout集成模型和深度集成模型的相关系数分别为0.699和0.692。两种模型的最高AvU值均为0.866。两种模型表现最佳的不确定性指标是CV,MC Dropout集成模型和深度集成模型的R-DSC AUC分别为0.783和0.782。对于所有不确定性指标,根据0.85验证DSC的不确定性阈值转诊患者,平均而言,MC Dropout集成模型和深度集成模型的DSC分别比完整数据集提高了4.7%和5.0%,同时分别转诊了21.8%和22%的患者。
我们发现,许多研究方法在预测分割质量和转诊性能方面提供了总体相似但又有所不同的效用。这些发现是朝着在OPC GTVp分割中更广泛地实施不确定性量化迈出的关键第一步。