Liu Yang, Zhang Yao, Wang Yixin, Hou Feng, Yuan Jin, Tian Jiang, Zhang Yang, Shi Zhongchao, Fan Jianping, He Zhiqiang
IEEE Trans Neural Netw Learn Syst. 2024 Jun;35(6):7478-7498. doi: 10.1109/TNNLS.2022.3227717. Epub 2024 Jun 3.
Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natural language processing (NLP). Inspired by such significant achievements, some pioneering works have recently been done on employing Transformer-liked architectures in the computer vision (CV) field, which have demonstrated their effectiveness on three fundamental CV tasks (classification, detection, and segmentation) as well as multiple sensory data stream (images, point clouds, and vision-language data). Because of their competitive modeling capabilities, the visual Transformers have achieved impressive performance improvements over multiple benchmarks as compared with modern convolution neural networks (CNNs). In this survey, we have reviewed over 100 of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, where taxonomy is proposed to organize the representative methods according to their motivations, structures, and application scenarios. Because of their differences on training settings and dedicated vision tasks, we have also evaluated and compared all these existing visual Transformers under different configurations. Furthermore, we have revealed a series of essential but unexploited aspects that may empower such visual Transformers to stand out from numerous architectures, e.g., slack high-level semantic embeddings to bridge the gap between the visual Transformers and the sequential ones. Finally, two promising research directions are suggested for future investment. We will continue to update the latest articles and their released source codes at https://github.com/liuyang-ict/awesome-visual-transformers.
Transformer是一种基于注意力机制的编码器-解码器模型,它已经彻底改变了自然语言处理(NLP)领域。受这些重大成就的启发,最近一些开创性的工作致力于在计算机视觉(CV)领域采用类似Transformer的架构,这些架构已经在三个基本的CV任务(分类、检测和分割)以及多个感官数据流(图像、点云以及视觉-语言数据)上证明了其有效性。由于其具有竞争力的建模能力,与现代卷积神经网络(CNN)相比,视觉Transformer在多个基准测试中取得了令人瞩目的性能提升。在本次综述中,我们根据三个基本的CV任务和不同的数据流类型,全面回顾了100多种不同的视觉Transformer,并提出了分类法,以便根据其动机、结构和应用场景对代表性方法进行组织。由于它们在训练设置和特定视觉任务上存在差异,我们还在不同配置下对所有这些现有的视觉Transformer进行了评估和比较。此外,我们还揭示了一系列重要但尚未被充分利用的方面,这些方面可能使此类视觉Transformer在众多架构中脱颖而出,例如,利用宽松的高级语义嵌入来弥合视觉Transformer与序列模型之间的差距。最后,我们为未来的研究提出了两个有前景的研究方向。我们将继续在https://github.com/liuyang-ict/awesome-visual-transformers上更新最新的文章及其发布的源代码。