Chibuike Okpala, Yang Xiaopeng
Department of Human Ecology & Technology, Handong Global University, Pohang 37554, Republic of Korea.
School of Global Entrepreneurship and Information Communication Technology, Handong Global University, Pohang 37554, Republic of Korea.
Diagnostics (Basel). 2024 Dec 12;14(24):2790. doi: 10.3390/diagnostics14242790.
BACKGROUND/OBJECTIVES: Vision Transformers (ViTs) and convolutional neural networks (CNNs) have demonstrated remarkable performances in image classification, especially in the domain of medical imaging analysis. However, ViTs struggle to capture high-frequency components of images, which are critical in identifying fine-grained patterns, while CNNs have difficulties in capturing long-range dependencies due to their local receptive fields, which makes it difficult to fully capture the spatial relationship across lung regions.
In this paper, we proposed a hybrid architecture that integrates ViTs and CNNs within a modular component block(s) to leverage both local feature extraction and global context capture. In each component block, the CNN is used to extract the local features, which are then passed through the ViT to capture the global dependencies. We implemented a gated attention mechanism that combines the channel-, spatial-, and element-wise attention to selectively emphasize the important features, thereby enhancing overall feature representation. Furthermore, we incorporated a multi-scale fusion module (MSFM) in the proposed framework to fuse the features at different scales for more comprehensive feature representation.
Our proposed model achieved an accuracy of 99.50% in the classification of four pulmonary conditions.
Through extensive experiments and ablation studies, we demonstrated the effectiveness of our approach in improving the medical image classification performance, while achieving good calibration results. This hybrid approach offers a promising framework for reliable and accurate disease diagnosis in medical imaging.
背景/目的:视觉Transformer(ViT)和卷积神经网络(CNN)在图像分类中表现出色,尤其是在医学影像分析领域。然而,ViT难以捕捉图像的高频成分,而高频成分对于识别细粒度模式至关重要,而CNN由于其局部感受野,在捕捉长程依赖方面存在困难,这使得难以充分捕捉肺部区域之间的空间关系。
在本文中,我们提出了一种混合架构,在模块化组件块中集成ViT和CNN,以利用局部特征提取和全局上下文捕捉。在每个组件块中,CNN用于提取局部特征,然后将其传递给ViT以捕捉全局依赖。我们实现了一种门控注意力机制,该机制结合通道、空间和逐元素注意力,有选择地强调重要特征,从而增强整体特征表示。此外,我们在所提出的框架中纳入了多尺度融合模块(MSFM),以融合不同尺度的特征,实现更全面的特征表示。
我们提出的模型在四种肺部疾病的分类中达到了99.50%的准确率。
通过广泛的实验和消融研究,我们证明了我们的方法在提高医学图像分类性能方面的有效性,同时取得了良好的校准结果。这种混合方法为医学影像中可靠准确的疾病诊断提供了一个有前景的框架。