多放射科医生与全自动深度学习系统的前列腺 MRI 病变分割一致性比较。

Comparison of Prostate MRI Lesion Segmentation Agreement Between Multiple Radiologists and a Fully Automatic Deep Learning System.

机构信息

Division of Radiology, German Cancer Research Center (DKFZ), Heidelberg, Germany.

Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany.

出版信息

Rofo. 2021 May;193(5):559-573. doi: 10.1055/a-1290-8070. Epub 2020 Nov 19.

DOI:10.1055/a-1290-8070

PMID:33212541

Abstract

PURPOSE

A recently developed deep learning model (U-Net) approximated the clinical performance of radiologists in the prediction of clinically significant prostate cancer (sPC) from prostate MRI. Here, we compare the agreement between lesion segmentations by U-Net with manual lesion segmentations performed by different radiologists.

MATERIALS AND METHODS

165 patients with suspicion for sPC underwent targeted and systematic fusion biopsy following 3 Tesla multiparametric MRI (mpMRI). Five sets of segmentations were generated retrospectively: segmentations of clinical lesions, independent segmentations by three radiologists, and fully automated bi-parametric U-Net segmentations. Per-lesion agreement was calculated for each rater by averaging Dice coefficients with all overlapping lesions from other raters. Agreement was compared using descriptive statistics and linear mixed models.

RESULTS

The mean Dice coefficient for manual segmentations showed only moderate agreement at 0.48-0.52, reflecting the difficult visual task of determining the outline of otherwise jointly detected lesions. U-net segmentations were significantly smaller than manual segmentations (p < 0.0001) and exhibited a lower mean Dice coefficient of 0.22, which was significantly lower compared to manual segmentations (all p < 0.0001). These differences remained after correction for lesion size and were unaffected between sPC and non-sPC lesions and between peripheral and transition zone lesions.

CONCLUSION

Knowledge of the order of agreement of manual segmentations of different radiologists is important to set the expectation value for artificial intelligence (AI) systems in the task of prostate MRI lesion segmentation. Perfect agreement (Dice coefficient of one) should not be expected for AI. Lower Dice coefficients of U-Net compared to manual segmentations are only partially explained by smaller segmentation sizes and may result from a focus on the lesion core and a small relative lesion center shift. Although it is primarily important that AI detects sPC correctly, the Dice coefficient for overlapping lesions from multiple raters can be used as a secondary measure for segmentation quality in future studies.

KEY POINTS

· Intermediate human Dice coefficients reflect the difficulty of outlining jointly detected lesions.. · Lower Dice coefficients of deep learning motivate further research to approximate human perception.. · Comparable predictive performance of deep learning appears independent of Dice agreement.. · Dice agreement independent of significant cancer presence indicates indistinguishability of some benign imaging findings.. · Improving DWI to T2 registration may improve the observed U-Net Dice coefficients..

CITATION FORMAT

· Schelb P, Tavakoli AA, Tubtawee T et al. Comparison of Prostate MRI Lesion Segmentation Agreement Between Multiple Radiologists and a Fully Automatic Deep Learning System. Fortschr Röntgenstr 2021; 193: 559 - 573.

摘要

目的

最近开发的深度学习模型（U-Net）在预测前列腺 MRI 中具有临床意义的前列腺癌（sPC）方面，可近似于放射科医生的临床表现。在此，我们比较了 U-Net 与不同放射科医生手动病变分割之间的病变分割一致性。

材料和方法

165 例 sPC 疑似患者在 3T 多参数 MRI（mpMRI）后行靶向和系统融合活检。回顾性生成了五组分割：临床病变分割、三名放射科医生的独立分割以及完全自动的双参数 U-Net 分割。通过计算每个评分者与其他评分者重叠病变的平均 Dice 系数，为每个病变计算评分者间的一致性。使用描述性统计和线性混合模型进行比较。

结果

手动分割的平均 Dice 系数仅为 0.48-0.52，表现出中度一致性，这反映了确定否则共同检测到的病变轮廓的困难视觉任务。U-net 分割明显小于手动分割（p<0.0001），平均 Dice 系数为 0.22，与手动分割明显更低（均 p<0.0001）。这些差异在考虑病变大小后仍然存在，并且不受 sPC 和非 sPC 病变之间以及外周区和移行区病变之间的影响。

结论

了解不同放射科医生手动分割的一致性顺序对于在前列腺 MRI 病变分割任务中设置人工智能（AI）系统的预期值非常重要。不应该期望 AI 具有完美的一致性（Dice 系数为 1）。与手动分割相比，U-Net 的低 Dice 系数仅部分归因于较小的分割大小，并且可能是由于专注于病变核心和相对较小的病变中心移位造成的。尽管 AI 正确检测 sPC 非常重要，但来自多个评分者的重叠病变的 Dice 系数可在未来研究中作为分割质量的次要衡量标准。

要点

·中等程度的人类 Dice 系数反映了描绘共同检测到的病变的难度。·深度学习的较低 Dice 系数促使进一步研究来近似人类感知。·深度学习的可比预测性能独立于 Dice 一致性。·无显著癌症存在的 Dice 一致性表明一些良性影像学发现难以区分。·改善 DWI 与 T2 配准可能会提高观察到的 U-Net Dice 系数。