Van Craenendonck Toon, Elen Bart, Gerrits Nele, De Boever Patrick
VITO NV, Unit Health, Mol, Belgium.
Transl Vis Sci Technol. 2020 Dec 29;9(2):64. doi: 10.1167/tvst.9.2.64. eCollection 2020 Dec.
Heatmapping techniques can support explainability of deep learning (DL) predictions in medical image analysis. However, individual techniques have been mainly applied in a descriptive way without an objective and systematic evaluation. We investigated comparative performances using diabetic retinopathy lesion detection as a benchmark task.
The Indian Diabetic Retinopathy Image Dataset (IDRiD) publicly available database contains fundus images of diabetes patients with pixel level annotations of diabetic retinopathy (DR) lesions, the ground truth for this study. Three in advance trained DL models (ResNet50, VGG16 or InceptionV3) were used for DR detection in these images. Next, explainability was visualized with each of the 10 most used heatmapping techniques. The quantitative correspondence between the output of a heatmap and the ground truth was evaluated with the Explainability Consistency Score (ECS), a metric between 0 and 1, developed for this comparative task.
In case of the overall DR lesions detection, the ECS ranged from 0.21 to 0.51 for all model/heatmapping combinations. The highest score was for VGG16+Grad-CAM (ECS = 0.51; 95% confidence interval [CI]: [0.46; 0.55]). For individual lesions, VGG16+Grad-CAM performed best on hemorrhages and hard exudates. ResNet50+SmoothGrad performed best for soft exudates and ResNet50+Guided Backpropagation performed best for microaneurysms.
Our empirical evaluation on the IDRiD database demonstrated that the combination DL model/heatmapping affects explainability when considering common DR lesions. Our approach found considerable disagreement between regions highlighted by heatmaps and expert annotations.
We warrant a more systematic investigation and analysis of heatmaps for reliable explanation of image-based predictions of deep learning models.
热图技术可支持医学图像分析中深度学习(DL)预测的可解释性。然而,个体技术主要以描述性方式应用,缺乏客观且系统的评估。我们以糖尿病视网膜病变病变检测作为基准任务,研究了比较性能。
公开可用的印度糖尿病视网膜病变图像数据集(IDRiD)包含糖尿病患者的眼底图像,带有糖尿病视网膜病变(DR)病变的像素级注释,即本研究的真实情况。使用三个预先训练的DL模型(ResNet50、VGG16或InceptionV3)对这些图像进行DR检测。接下来,使用10种最常用的热图技术中的每一种来可视化可解释性。热图输出与真实情况之间的定量对应关系通过可解释性一致性评分(ECS)进行评估,ECS是为该比较任务开发的介于0和1之间的度量。
在总体DR病变检测中,所有模型/热图组合的ECS范围为0.21至0.51。得分最高的是VGG16 + Grad - CAM(ECS = 0.51;95%置信区间[CI]:[0.46;0.55])。对于个体病变,VGG16 + Grad - CAM在出血和硬性渗出物方面表现最佳。ResNet50 + SmoothGrad在软性渗出物方面表现最佳,ResNet50 + 引导反向传播在微动脉瘤方面表现最佳。
我们在IDRiD数据库上的实证评估表明,在考虑常见DR病变时,DL模型/热图的组合会影响可解释性。我们的方法发现热图突出显示的区域与专家注释之间存在相当大的差异。
我们保证对热图进行更系统的研究和分析,以可靠地解释深度学习模型基于图像的预测。