Kim Ji Woong, Khan Aisha Urooj, Banerjee Imon
School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA.
Department of Radiology, Mayo Clinic, Phoenix, AZ, USA.
J Imaging Inform Med. 2025 Jan 27. doi: 10.1007/s10278-024-01322-4.
Vision transformer (ViT)and convolutional neural networks (CNNs) each possess distinct strengths in medical imaging: ViT excels in capturing long-range dependencies through self-attention, while CNNs are adept at extracting local features via spatial convolution filters. While ViT may struggle with capturing detailed local spatial information, critical for tasks like anomaly detection in medical imaging, shallow CNNs often fail to effectively abstract global context. This study aims to explore and evaluate hybrid architectures that integrate ViT and CNN to leverage their complementary strengths for enhanced performance in medical vision tasks, such as segmentation, classification, reconstruction, and prediction. Following PRISMA guideline, a systematic review was conducted on 34 articles published between 2020 and Sept. 2024. These articles proposed novel hybrid ViT-CNN architectures specifically for medical imaging tasks in radiology. The review focused on analyzing architectural variations, merging strategies between ViT and CNN, innovative applications of ViT, and efficiency metrics including parameters, inference time (GFlops), and performance benchmarks. The review identified that integrating ViT and CNN can mitigate the limitations of each architecture offering comprehensive solutions that combine global context understanding with precise local feature extraction. We benchmarked the articles based on architectural variations, merging strategies, innovative uses of ViT, and efficiency metrics (number of parameters, inference time (GFlops), and performance), and derived a ranked list. By synthesizing current literature, this review defines fundamental concepts of hybrid vision transformers and highlights emerging trends in the field. It provides a clear direction for future research aimed at optimizing the integration of ViT and CNN for effective utilization in medical imaging, contributing to advancements in diagnostic accuracy and image analysis. We performed systematic review of hybrid vision transformer architecture using PRISMA guideline and performed thorough comparative analysis to benchmark the architectures.
视觉Transformer(ViT)和卷积神经网络(CNN)在医学成像领域各有独特优势:ViT擅长通过自注意力捕捉长程依赖关系,而CNN则善于通过空间卷积滤波器提取局部特征。虽然ViT在捕捉详细的局部空间信息方面可能存在困难,而这对医学成像中的异常检测等任务至关重要,但浅层CNN往往无法有效提取全局上下文信息。本研究旨在探索和评估整合ViT和CNN的混合架构,以利用它们的互补优势,在医学视觉任务(如分割、分类、重建和预测)中提升性能。按照PRISMA指南,对2020年至2024年9月期间发表的34篇文章进行了系统综述。这些文章提出了专门用于放射学医学成像任务的新型ViT-CNN混合架构。该综述重点分析了架构变化、ViT和CNN之间的融合策略、ViT的创新应用以及效率指标,包括参数、推理时间(每秒千兆浮点运算次数)和性能基准。该综述发现,整合ViT和CNN可以减轻每种架构的局限性,提供结合全局上下文理解和精确局部特征提取的全面解决方案。我们根据架构变化、融合策略、ViT的创新用途和效率指标(参数数量、推理时间(每秒千兆浮点运算次数)和性能)对这些文章进行了基准测试,并得出了一个排名列表。通过综合当前文献,本综述定义了混合视觉Transformer的基本概念,并突出了该领域的新兴趋势。它为未来旨在优化ViT和CNN整合以在医学成像中有效应用的研究提供了明确方向,有助于提高诊断准确性和图像分析的进展。我们使用PRISMA指南对混合视觉Transformer架构进行了系统综述,并进行了全面的比较分析以对这些架构进行基准测试。