LIV4D, Polytechnique Montréal, 2500 Ch. de Polytechnique, Montréal, QC, H3T 1J4, Canada.
Centre Universitaire d'Ophtalmologie, Maisonneuve-Rosemont Hospital, 5415 Boul. de l'Assomption, Montréal, QC, H1T 2M4, Canada.
Med Image Anal. 2022 Nov;82:102608. doi: 10.1016/j.media.2022.102608. Epub 2022 Sep 7.
Vision Transformers have recently emerged as a competitive architecture in image classification. The tremendous popularity of this model and its variants comes from its high performance and its ability to produce interpretable predictions. However, both of these characteristics remain to be assessed in depth on retinal images. This study proposes a thorough performance evaluation of several Transformers compared to traditional Convolutional Neural Network (CNN) models for retinal disease classification. Special attention is given to multi-modality imaging (fundus and OCT) and generalization to external data. In addition, we propose a novel mechanism to generate interpretable predictions via attribution maps. Existing attribution methods from Transformer models have the disadvantage of producing low-resolution heatmaps. Our contribution, called Focused Attention, uses iterative conditional patch resampling to tackle this issue. By means of a survey involving four retinal specialists, we validated both the superior interpretability of Vision Transformers compared to the attribution maps produced from CNNs and the relevance of Focused Attention as a lesion detector.
视觉转换器最近在图像分类中崭露头角,成为一种具有竞争力的架构。这种模型及其变体的巨大流行,源于其高性能和产生可解释预测的能力。然而,这些特性在视网膜图像上仍需要深入评估。本研究提出了对几种转换器与传统卷积神经网络(CNN)模型在视网膜疾病分类方面的全面性能评估。特别关注多模态成像(眼底和 OCT)和对外部数据的泛化。此外,我们提出了一种通过归因图生成可解释预测的新机制。来自 Transformer 模型的现有归因方法存在生成低分辨率热图的缺点。我们的贡献称为聚焦注意力,使用迭代条件补丁重采样来解决这个问题。通过涉及四名视网膜专家的调查,我们验证了与 CNN 生成的归因图相比,Vision Transformer 具有更高的可解释性,以及 Focused Attention 作为病变检测器的相关性。