Goh Jocelyn Hui Lin, Ang Elroy, Srinivasan Sahana, Lei Xiaofeng, Loh Johnathan, Quek Ten Cheer, Xue Cancan, Xu Xinxing, Liu Yong, Cheng Ching-Yu, Rajapakse Jagath C, Tham Yih-Chung
Singapore Eye Research Institute, Singapore National Eye Center, Singapore, Singapore.
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore.
Ophthalmol Sci. 2024 May 17;4(6):100552. doi: 10.1016/j.xops.2024.100552. eCollection 2024 Nov-Dec.
Vision transformers (ViTs) have shown promising performance in various classification tasks previously dominated by convolutional neural networks (CNNs). However, the performance of ViTs in referable diabetic retinopathy (DR) detection is relatively underexplored. In this study, using retinal photographs, we evaluated the comparative performances of ViTs and CNNs on detection of referable DR.
Retrospective study.
A total of 48 269 retinal images from the open-source Kaggle DR detection dataset, the Messidor-1 dataset and the Singapore Epidemiology of Eye Diseases (SEED) study were included.
Using 41 614 retinal photographs from the Kaggle dataset, we developed 5 CNN (Visual Geometry Group 19, ResNet50, InceptionV3, DenseNet201, and EfficientNetV2S) and 4 ViTs models (VAN_small, CrossViT_small, ViT_small, and Hierarchical Vision transformer using Shifted Windows [SWIN]_tiny) for the detection of referable DR. We defined the presence of referable DR as eyes with moderate or worse DR. The comparative performance of all 9 models was evaluated in the Kaggle internal test dataset (with 1045 study eyes), and in 2 external test sets, the SEED study (5455 study eyes) and the Messidor-1 (1200 study eyes).
Area under operating characteristics curve (AUC), specificity, and sensitivity.
Among all models, the SWIN transformer displayed the highest AUC of 95.7% on the internal test set, significantly outperforming the CNN models (all < 0.001). The same observation was confirmed in the external test sets, with the SWIN transformer achieving AUC of 97.3% in SEED and 96.3% in Messidor-1. When specificity level was fixed at 80% for the internal test, the SWIN transformer achieved the highest sensitivity of 94.4%, significantly better than all the CNN models (sensitivity levels ranging between 76.3% and 83.8%; all < 0.001). This trend was also consistently observed in both external test sets.
Our findings demonstrate that ViTs provide superior performance over CNNs in detecting referable DR from retinal photographs. These results point to the potential of utilizing ViT models to improve and optimize retinal photo-based deep learning for referable DR detection.
Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
视觉Transformer(ViT)在以前由卷积神经网络(CNN)主导的各种分类任务中表现出了良好的性能。然而,ViT在可参考性糖尿病视网膜病变(DR)检测中的性能相对较少被探索。在本研究中,我们使用视网膜照片评估了ViT和CNN在检测可参考性DR方面的比较性能。
回顾性研究。
纳入了来自开源Kaggle DR检测数据集、Messidor-1数据集和新加坡眼病流行病学(SEED)研究的总共48269张视网膜图像。
我们使用来自Kaggle数据集的41614张视网膜照片,开发了5种CNN(视觉几何组19、ResNet50、InceptionV3、DenseNet201和EfficientNetV2S)和4种ViT模型(VAN_small、CrossViT_small、ViT_small和使用移位窗口的分层视觉Transformer [SWIN]_tiny)用于检测可参考性DR。我们将可参考性DR的存在定义为患有中度或更严重DR的眼睛。在Kaggle内部测试数据集(有1045只研究眼)以及2个外部测试集,即SEED研究(5455只研究眼)和Messidor-1(1200只研究眼)中评估了所有9种模型的比较性能。
操作特征曲线下面积(AUC)、特异性和敏感性。
在所有模型中,SWIN Transformer在内部测试集上显示出最高的AUC,为95.7%,显著优于CNN模型(所有P < 0.001)。在外部测试集中也得到了相同的观察结果,SWIN Transformer在SEED中达到了97.3%的AUC,在Messidor-1中达到了96.3%的AUC。当内部测试的特异性水平固定为80%时,SWIN Transformer达到了最高的敏感性,为94.4%,显著优于所有CNN模型(敏感性水平在76.3%至83.8%之间;所有P < 0.001)。在两个外部测试集中也一致观察到了这种趋势。
我们的研究结果表明,在从视网膜照片中检测可参考性DR方面,ViT比CNN具有更优越的性能。这些结果指出了利用ViT模型改进和优化基于视网膜照片的深度学习以进行可参考性DR检测的潜力。
在本文末尾的脚注和披露中可能会发现专有或商业披露信息。