From the Department of Biomedical Engineering, Center for Biotechnology and Interdisciplinary Studies, Rensselaer Polytechnic Institute, 110 8th St, Biotech 4231, Troy, NY 12180 (J.Z., H.C., G.W., P.Y.); and Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (G.D., M.K.K.).
Radiol Artif Intell. 2024 Jan;6(1):e220221. doi: 10.1148/ryai.220221.
Purpose To determine whether saliency maps in radiology artificial intelligence (AI) are vulnerable to subtle perturbations of the input, which could lead to misleading interpretations, using prediction-saliency correlation (PSC) for evaluating the sensitivity and robustness of saliency methods. Materials and Methods In this retrospective study, locally trained deep learning models and a research prototype provided by a commercial vendor were systematically evaluated on 191 229 chest radiographs from the CheXpert dataset and 7022 MR images from a human brain tumor classification dataset. Two radiologists performed a reader study on 270 chest radiograph pairs. A model-agnostic approach for computing the PSC coefficient was used to evaluate the sensitivity and robustness of seven commonly used saliency methods. Results The saliency methods had low sensitivity (maximum PSC, 0.25; 95% CI: 0.12, 0.38) and weak robustness (maximum PSC, 0.12; 95% CI: 0.0, 0.25) on the CheXpert dataset, as demonstrated by leveraging locally trained model parameters. Further evaluation showed that the saliency maps generated from a commercial prototype could be irrelevant to the model output, without knowledge of the model specifics (area under the receiver operating characteristic curve decreased by 8.6% without affecting the saliency map). The human observer studies confirmed that it is difficult for experts to identify the perturbed images; the experts had less than 44.8% correctness. Conclusion Popular saliency methods scored low PSC values on the two datasets of perturbed chest radiographs, indicating weak sensitivity and robustness. The proposed PSC metric provides a valuable quantification tool for validating the trustworthiness of medical AI explainability. Saliency Maps, AI Trustworthiness, Dynamic Consistency, Sensitivity, Robustness © RSNA, 2023 See also the commentary by Yanagawa and Sato in this issue.
目的 利用预测显著相关(PSC)评估显著图方法的灵敏度和稳健性,以确定放射学人工智能(AI)中的显著图是否容易受到输入的细微干扰,从而导致误导性解释。
材料与方法 本回顾性研究对来自 CheXpert 数据集的 191 229 张胸部 X 线片和人类脑肿瘤分类数据集的 7022 张磁共振图像,分别使用本地训练的深度学习模型和商业供应商提供的研究原型进行了系统评估。两位放射科医生对 270 对胸部 X 线片进行了读片研究。使用一种模型不可知的方法计算 PSC 系数,以评估七种常用显著图方法的灵敏度和稳健性。
结果 在 CheXpert 数据集上,利用本地训练的模型参数,显著图方法的灵敏度较低(最大 PSC 为 0.25;95%CI:0.12,0.38),稳健性较弱(最大 PSC 为 0.12;95%CI:0.0,0.25)。进一步的评估表明,无需了解模型的具体细节,商业原型生成的显著图可能与模型输出无关(不影响显著图时,受试者工作特征曲线下面积降低 8.6%)。人体观察者研究证实,专家很难识别受到干扰的图像;专家的正确率不到 44.8%。
结论 在两个胸部 X 射线片受扰数据集上,流行的显著图方法的 PSC 值较低,表明灵敏度和稳健性较弱。所提出的 PSC 指标为验证医学 AI 可解释性的可信度提供了有价值的量化工具。
显著图,人工智能可信度,动态一致性,灵敏度,稳健性 ©RSNA,2023 也可参见本期 Yanagawa 和 Sato 的评论。