College of Science, China Jiliang University, Hangzhou, Zhejiang, China.
Key Laboratory of Intelligent Manufacturing Quality Big Data Tracing and Analysis of Zhejiang Province, Hangzhou, Zhejiang, China.
PLoS One. 2024 Mar 6;19(3):e0299265. doi: 10.1371/journal.pone.0299265. eCollection 2024.
Computer-aided diagnosis systems based on deep learning algorithms have shown potential applications in rapid diagnosis of diabetic retinopathy (DR). Due to the superior performance of Transformer over convolutional neural networks (CNN) on natural images, we attempted to develop a new model to classify referable DR based on a limited number of large-size retinal images by using Transformer. Vision Transformer (ViT) with Masked Autoencoders (MAE) was applied in this study to improve the classification performance of referable DR. We collected over 100,000 publicly fundus retinal images larger than 224×224, and then pre-trained ViT on these retinal images using MAE. The pre-trained ViT was applied to classify referable DR, the performance was also compared with that of ViT pre-trained using ImageNet. The improvement in model classification performance by pre-training with over 100,000 retinal images using MAE is superior to that pre-trained with ImageNet. The accuracy, area under curve (AUC), highest sensitivity and highest specificity of the present model are 93.42%, 0.9853, 0.973 and 0.9539, respectively. This study shows that MAE can provide more flexibility to the input image and substantially reduce the number of images required. Meanwhile, the pretraining dataset scale in this study is much smaller than ImageNet, and the pre-trained weights from ImageNet are not required also.
基于深度学习算法的计算机辅助诊断系统在糖尿病视网膜病变 (DR) 的快速诊断中显示出了潜在的应用。由于 Transformer 在自然图像上的表现优于卷积神经网络 (CNN),我们尝试开发一种新的模型,通过使用 Transformer 基于有限数量的大尺寸视网膜图像来分类可转诊 DR。在这项研究中,应用 Vision Transformer (ViT) 与 Masked Autoencoders (MAE) 来提高可转诊 DR 的分类性能。我们收集了超过 10 万张大于 224×224 的公共眼底视网膜图像,然后使用 MAE 对这些视网膜图像进行预训练 ViT。应用预训练的 ViT 来分类可转诊 DR,并将其性能与使用 ImageNet 预训练的 ViT 进行比较。通过使用 MAE 对超过 10 万张视网膜图像进行预训练,模型分类性能的提高优于使用 ImageNet 进行预训练。本模型的准确率、曲线下面积 (AUC)、最高灵敏度和最高特异性分别为 93.42%、0.9853、0.973 和 0.9539。这项研究表明,MAE 可以为输入图像提供更多的灵活性,并大大减少所需的图像数量。同时,本研究中的预训练数据集规模比 ImageNet 小得多,也不需要来自 ImageNet 的预训练权重。