用于细粒度医学图像分类的Transformer注意力融合

Transformer attention fusion for fine grained medical image classification.

作者信息

Badar Danyal, Abbas Junaid, Alsini Raed, Abbas Tahir, ChengLiang Wang, Daud Ali

机构信息

College of Computer Science, Chongqing University, Chongqing, China.

School of Big Data and Software Engineering, Chongqing University, Chongqing, China.

出版信息

Sci Rep. 2025 Jul 1;15(1):20655. doi: 10.1038/s41598-025-07561-x.

DOI:10.1038/s41598-025-07561-x

PMID:40596233

Abstract

Fine-grained visual classification is fundamental for medical image applications because it detects minor lesions. Diabetic retinopathy (DR) is a preventable cause of blindness, which requires exact and timely diagnosis to prevent vision damage. The challenges automated DR classification systems face include irregular lesions, uneven distributions between image classes, and inconsistent image quality that reduces diagnostic accuracy during early detection stages. Our solution to these problems includes MSCAS-Net (Multi-Scale Cross and Self-Attention Network), which uses the Swin Transformer as the backbone. It extracts features at three different resolutions (12 × 12, 24 × 24, 48 × 48), allowing it to detect subtle local features and global elements. This model uses self-attention mechanics to improve spatial connections between single scales and cross-attention to automatically match feature patterns across multiple scales, thereby developing a comprehensive information structure. The model becomes better at detecting significant lesions because of its dual mechanism, which focuses on both attention points. MSCAS-Net displays the best performance on APTOS and DDR and IDRID benchmarks by reaching accuracy levels of 93.8%, 89.80% and 86.70%, respectively. Through its algorithm, the model solves problems with imbalanced datasets and inconsistent image quality without needing data augmentation because it learns stable features. MSCAS-Net demonstrates a breakthrough in automated DR diagnostics since it combines high diagnostic precision with interpretable abilities to become an efficient AI-powered clinical decision support system. The presented research demonstrates how fine-grained visual classification methods benefit detecting and treating DR during its early stages.

摘要

细粒度视觉分类对于医学图像应用至关重要，因为它能检测出微小病变。糖尿病视网膜病变（DR）是一种可预防的致盲原因，需要准确及时的诊断以防止视力受损。自动化DR分类系统面临的挑战包括病变不规则、图像类别之间分布不均以及图像质量不一致，这些都会降低早期检测阶段的诊断准确性。我们针对这些问题的解决方案包括MSCAS-Net（多尺度交叉与自注意力网络），它以Swin Transformer作为主干。该模型在三种不同分辨率（12×12、24×24、48×48）下提取特征，使其能够检测到细微的局部特征和全局元素。此模型使用自注意力机制来改善单尺度之间的空间连接，并通过交叉注意力自动匹配多尺度之间的特征模式，从而构建出一个全面的信息结构。由于其双重机制专注于两个注意力点，该模型在检测重大病变方面表现更佳。MSCAS-Net在APROS、DDR和IDRID基准测试中分别达到了93.8%、89.80%和86.70%的准确率，展现出最佳性能。通过其算法，该模型无需数据增强就能解决数据集不平衡和图像质量不一致的问题，因为它能学习到稳定的特征。MSCAS-Net将高诊断精度与可解释能力相结合，成为一个高效的人工智能驱动的临床决策支持系统，在自动化DR诊断方面取得了突破。本研究展示了细粒度视觉分类方法如何在DR的早期阶段有助于检测和治疗。