School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou, China.
School of Traditional Chinese Medicine, Jinan University, Guangzhou, China.
Med Phys. 2021 Dec;48(12):7850-7863. doi: 10.1002/mp.15312. Epub 2021 Nov 16.
In the domain of natural language processing, Transformers are recognized as state-of-the-art models, which opposing to typical convolutional neural networks (CNNs) do not rely on convolution layers. Instead, Transformers employ multi-head attention mechanisms as the main building block to capture long-range contextual relations between image pixels. Recently, CNNs dominated the deep learning solutions for diabetic retinopathy grade recognition. However, spurred by the advantages of Transformers, we propose a Transformer-based method that is appropriate for recognizing the grade of diabetic retinopathy.
The purposes of this work are to demonstrate that (i) the pure attention mechanism is suitable for diabetic retinopathy grade recognition and (ii) Transformers can replace traditional CNNs for diabetic retinopathy grade recognition.
This paper proposes a Vision Transformer-based method to recognize the grade of diabetic retinopathy. Fundus images are subdivided into non-overlapping patches, which are then converted into sequences by flattening, and undergo a linear and positional embedding process to preserve positional information. Then, the generated sequence is input into several multi-head attention layers to generate the final representation. The first token sequence is input to a softmax classification layer to produce the recognition output in the classification stage.
The dataset for training and testing employs fundus images of different resolutions, subdivided into patches. We challenge our method against current CNNs and extreme learning machines and achieve an appealing performance. Specifically, the suggested deep learning architecture attains an accuracy of 91.4%, specificity = 0.977 (95% confidence interval (CI) (0.951-1)), precision = 0.928 (95% CI (0.852-1)), sensitivity = 0.926 (95% CI (0.863-0.989)), quadratic weighted kappa score = 0.935, and area under curve (AUC) = 0.986.
Our comparative experiments against current methods conclude that our model is competitive and highlight that an attention mechanism based on a Vision Transformer model is promising for the diabetic retinopathy grade recognition task.
在自然语言处理领域,Transformer 被认为是最先进的模型,与典型的卷积神经网络(CNN)不同,它不依赖于卷积层。相反,Transformer 采用多头注意力机制作为主要构建块,以捕捉图像像素之间的长程上下文关系。最近,CNN 主导了糖尿病视网膜病变分级识别的深度学习解决方案。然而,受 Transformer 优势的启发,我们提出了一种基于 Transformer 的方法,适用于识别糖尿病视网膜病变的分级。
本研究旨在证明(i)纯注意力机制适用于糖尿病视网膜病变分级识别,(ii)Transformer 可以替代传统 CNN 用于糖尿病视网膜病变分级识别。
本文提出了一种基于 Vision Transformer 的方法来识别糖尿病视网膜病变的分级。眼底图像被细分为不重叠的斑块,然后通过展平将其转换为序列,并通过线性和位置嵌入过程来保留位置信息。然后,生成的序列被输入到几个多头注意力层中,以生成最终的表示。在分类阶段,将第一个令牌序列输入到 softmax 分类层中,以生成识别输出。
用于训练和测试的数据集使用不同分辨率的眼底图像,细分为斑块。我们将我们的方法与当前的 CNN 和极限学习机进行了对比,并取得了令人满意的性能。具体来说,所提出的深度学习架构的准确率为 91.4%,特异性=0.977(95%置信区间(CI)(0.951-1)),精度=0.928(95% CI(0.852-1)),敏感性=0.926(95% CI(0.863-0.989)),二次加权kappa 得分=0.935,曲线下面积(AUC)=0.986。
我们与当前方法的对比实验得出结论,我们的模型具有竞争力,并强调基于 Vision Transformer 模型的注意力机制在糖尿病视网膜病变分级识别任务中具有很大的潜力。