Arun Nishanth, Gaw Nathan, Singh Praveer, Chang Ken, Aggarwal Mehak, Chen Bryan, Hoebel Katharina, Gupta Sharut, Patel Jay, Gidwani Mishka, Adebayo Julius, Li Matthew D, Kalpathy-Cramer Jayashree
Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, 149 13th St, Boston, MA 02129 (N.A., P.S., K.C., M.A., B.C., K.H., S.G., J.P., M.G., M.D.L., J.K.C.); Department of Computer Science, Shiv Nadar University, Greater Noida, India (N.A.); Department of Operational Sciences, Graduate School of Engineering and Management, Air Force Institute of Technology, Wright-Patterson AFB, Dayton, Ohio (N.G.); and Massachusetts Institute of Technology, Cambridge, Mass (K.C., B.C., K.H., J.P., J.A.).
Radiol Artif Intell. 2021 Oct 6;3(6):e200267. doi: 10.1148/ryai.2021200267. eCollection 2021 Nov.
To evaluate the trustworthiness of saliency maps for abnormality localization in medical imaging.
Using two large publicly available radiology datasets (Society for Imaging Informatics in Medicine-American College of Radiology Pneumothorax Segmentation dataset and Radiological Society of North America Pneumonia Detection Challenge dataset), the performance of eight commonly used saliency map techniques were quantified in regard to localization utility (segmentation and detection), sensitivity to model weight randomization, repeatability, and reproducibility. Their performances versus baseline methods and localization network architectures were compared, using area under the precision-recall curve (AUPRC) and structural similarity index measure (SSIM) as metrics.
All eight saliency map techniques failed at least one of the criteria and were inferior in performance compared with localization networks. For pneumothorax segmentation, the AUPRC ranged from 0.024 to 0.224, while a U-Net achieved a significantly superior AUPRC of 0.404 ( < .005). For pneumonia detection, the AUPRC ranged from 0.160 to 0.519, while a RetinaNet achieved a significantly superior AUPRC of 0.596 ( <.005). Five and two saliency methods (of eight) failed the model randomization test on the segmentation and detection datasets, respectively, suggesting that these methods are not sensitive to changes in model parameters. The repeatability and reproducibility of the majority of the saliency methods were worse than localization networks for both the segmentation and detection datasets.
The use of saliency maps in the high-risk domain of medical imaging warrants additional scrutiny and recommend that detection or segmentation models be used if localization is the desired output of the network. Technology Assessment, Technical Aspects, Feature Detection, Convolutional Neural Network (CNN) Supplemental material is available for this article. © RSNA, 2021.
评估显著性图在医学影像中异常定位的可信度。
使用两个大型公开可用的放射学数据集(医学影像信息学会 - 美国放射学会气胸分割数据集和北美放射学会肺炎检测挑战赛数据集),从定位效用(分割和检测)、对模型权重随机化的敏感性、可重复性和再现性方面对八种常用的显著性图技术的性能进行量化。使用精确召回率曲线下面积(AUPRC)和结构相似性指数测量(SSIM)作为指标,将它们与基线方法和定位网络架构的性能进行比较。
所有八种显著性图技术至少未达到其中一项标准,并且与定位网络相比性能较差。对于气胸分割,AUPRC范围为0.024至0.224,而一个U-Net实现了显著更高的AUPRC为0.404(P <.005)。对于肺炎检测,AUPRC范围为0.160至0.519,而一个RetinaNet实现了显著更高的AUPRC为0.596(P<.005)。八种显著性方法中的五种和两种分别在分割和检测数据集上未通过模型随机化测试,这表明这些方法对模型参数的变化不敏感。对于分割和检测数据集,大多数显著性方法的可重复性和再现性都比定位网络差。
在医学影像的高风险领域使用显著性图需要进一步审查,并建议如果网络期望的输出是定位,则使用检测或分割模型。技术评估、技术方面、特征检测、卷积神经网络(CNN) 本文提供补充材料。©RSNA,2021。