卷积神经网络还是视觉Transformer：谁将在视觉数据中的动作识别竞赛中胜出？

Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?

作者信息

Moutik Oumaima, Sekkat Hiba, Tigani Smail, Chehri Abdellah, Saadane Rachid, Tchakoucht Taha Ait, Paul Anand

机构信息

Engineering Unit, Euromed Research Center, Euro-Mediterranean University, Fes 30030, Morocco.

Department of Mathematics and Computer Science, Royal Military College of Canada, Kingston, ON 11 K7K 7B4, Canada.

出版信息

Sensors (Basel). 2023 Jan 9;23(2):734. doi: 10.3390/s23020734.

DOI:10.3390/s23020734

PMID:36679530

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9862752/

Abstract

Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis's outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.

摘要

理解视频中的动作仍然是计算机视觉领域的一项重大挑战，在过去几十年里一直是多项研究的主题。卷积神经网络（CNN）是该主题的重要组成部分，在深度学习的声誉中发挥着关键作用。受人类视觉系统的启发，CNN已被应用于视觉数据利用，并在各种计算机视觉任务和视频/图像分析中解决了各种挑战，包括动作识别（AR）。然而，不久前，随着Transformer在自然语言处理（NLP）方面取得的成就，它开始在视觉任务中引领新趋势，这引发了关于视觉Transformer模型（ViT）是否会在视频片段的动作识别中取代CNN的讨论。本文详细探讨了这个热门话题，分别研究了用于动作识别的CNN和Transformer，并对准确性与复杂度的权衡进行了比较研究。最后，基于性能分析的结果，将讨论CNN还是视觉Transformer将赢得这场竞赛的问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a468/9862752/5cebe5d81e23/sensors-23-00734-g001.jpg

相似文献

Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?卷积神经网络还是视觉Transformer：谁将在视觉数据中的动作识别竞赛中胜出？

Sensors (Basel). 2023 Jan 9;23(2):734. doi: 10.3390/s23020734.

Do it the transformer way: A comprehensive review of brain and vision transformers for autism spectrum disorder diagnosis and classification.采用变压器方法：自闭症谱系障碍诊断和分类的脑和视觉变压器的全面综述。

Comput Biol Med. 2023 Dec;167:107667. doi: 10.1016/j.compbiomed.2023.107667. Epub 2023 Nov 3.

RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers.RT-ViT：基于轻量级视觉Transformer 的实时单目深度估计。

Sensors (Basel). 2022 May 19;22(10):3849. doi: 10.3390/s22103849.

An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition.一种用于动作识别的具有同步时空和空间自注意力的高效视频变换器。

IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):2496-2509. doi: 10.1109/TNNLS.2022.3190367. Epub 2024 Feb 5.

A Survey on Vision Transformer.视觉Transformer综述

IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):87-110. doi: 10.1109/TPAMI.2022.3152247. Epub 2022 Dec 5.

Analyzing Transfer Learning of Vision Transformers for Interpreting Chest Radiography.分析视觉Transformer在解读胸部 X 光片方面的迁移学习。

J Digit Imaging. 2022 Dec;35(6):1445-1462. doi: 10.1007/s10278-022-00666-z. Epub 2022 Jul 11.

Ultrasound Image Analysis with Vision Transformers-Review.基于视觉Transformer的超声图像分析——综述

Diagnostics (Basel). 2024 Mar 4;14(5):542. doi: 10.3390/diagnostics14050542.

Pure Vision Transformer (CT-ViT) with Noise2Neighbors Interpolation for Low-Dose CT Image Denoising.基于 Noise2Neighbors 插值的纯 Vision Transformer（CT-ViT）用于低剂量 CT 图像降噪。

J Imaging Inform Med. 2024 Oct;37(5):2669-2687. doi: 10.1007/s10278-024-01108-8. Epub 2024 Apr 15.

Performance analysis of hybrid deep learning framework using a vision transformer and convolutional neural network for handwritten digit recognition.使用视觉Transformer和卷积神经网络的混合深度学习框架对手写数字识别的性能分析

MethodsX. 2024 Jan 5;12:102554. doi: 10.1016/j.mex.2024.102554. eCollection 2024 Jun.

Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives.基于 Transformer 的医学影像变革？关键特性、当前进展和未来展望的对比综述。

Med Image Anal. 2023 Apr;85:102762. doi: 10.1016/j.media.2023.102762. Epub 2023 Jan 31.

引用本文的文献

Lightweight hybrid transformers-based dyslexia detection using cross-modality data.基于轻量级混合变压器的跨模态数据诵读困难检测

Sci Rep. 2025 May 16;15(1):17054. doi: 10.1038/s41598-025-01235-4.

A Current Review of Generative AI in Medicine: Core Concepts, Applications, and Current Limitations.医学中生成式人工智能的当前综述：核心概念、应用及当前局限性

Curr Rev Musculoskelet Med. 2025 Apr 30. doi: 10.1007/s12178-025-09961-y.

The value of deep learning and radiomics models in predicting preoperative serosal invasion in gastric cancer: a dual-center study.深度学习和影像组学模型在预测胃癌术前浆膜侵犯中的价值：一项双中心研究。

Abdom Radiol (NY). 2025 Apr 26. doi: 10.1007/s00261-025-04949-1.

Predicting PD-L1 status in NSCLC patients using deep learning radiomics based on CT images.基于CT图像利用深度学习放射组学预测非小细胞肺癌患者的PD-L1状态。

Sci Rep. 2025 Apr 11;15(1):12495. doi: 10.1038/s41598-025-91575-y.

Non-invasive enhanced hypertension detection through ballistocardiograph signals with Mamba model.使用曼巴模型通过心冲击图信号进行无创增强型高血压检测。

PeerJ Comput Sci. 2025 Feb 21;11:e2711. doi: 10.7717/peerj-cs.2711. eCollection 2025.

Prediction of Arteriovenous Access Dysfunction by Mel Spectrogram-based Deep Learning Model.基于梅尔频谱的深度学习模型预测动静脉内瘘功能障碍

Int J Med Sci. 2024 Aug 19;21(12):2252-2260. doi: 10.7150/ijms.98421. eCollection 2024.

Deep learning for mango leaf disease identification: A vision transformer perspective.用于芒果叶病识别的深度学习：视觉Transformer视角

Heliyon. 2024 Aug 22;10(17):e36361. doi: 10.1016/j.heliyon.2024.e36361. eCollection 2024 Sep 15.

Development of a deep learning model for the automated detection of green pixels indicative of gout on dual energy CT scan.开发一种深度学习模型，用于在双能CT扫描上自动检测指示痛风的绿色像素。

Res Diagn Interv Imaging. 2024 Mar 8;9:100044. doi: 10.1016/j.redii.2024.100044. eCollection 2024 Mar.

Next-Gen Medical Imaging: U-Net Evolution and the Rise of Transformers.下一代医学成像：U-Net 进化与 Transformers 的崛起。

Sensors (Basel). 2024 Jul 18;24(14):4668. doi: 10.3390/s24144668.

Convolutional Neural Network Transformer (CNNT) for Fluorescence Microscopy image Denoising with Improved Generalization and Fast Adaptation.用于荧光显微镜图像去噪的卷积神经网络变压器（CNNT），具有改进的泛化能力和快速适应性。

ArXiv. 2024 Apr 6:arXiv:2404.04726v1.

本文引用的文献

Graph convolutional networks: a comprehensive review.图卷积网络：全面综述。

Comput Soc Netw. 2019;6(1):11. doi: 10.1186/s40649-019-0069-y. Epub 2019 Nov 10.

Human Action Recognition From Various Data Modalities: A Review.基于多种数据模态的人类行为识别综述

IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3200-3225. doi: 10.1109/TPAMI.2022.3183112. Epub 2023 Feb 3.

VidTr: Video Transformer Without Convolutions.VidTr：无卷积的视频变换器

Proc IEEE Int Conf Comput Vis. 2021 Oct;2021:13557-13567. doi: 10.1109/iccv48922.2021.01332.

An efficient self-attention network for skeleton-based action recognition.基于骨架的动作识别的高效自注意力网络。

Sci Rep. 2022 Mar 8;12(1):4111. doi: 10.1038/s41598-022-08157-5.

Constructing Stronger and Faster Baselines for Skeleton-Based Action Recognition.为基于骨骼的动作识别构建更强更快的基线

IEEE Trans Pattern Anal Mach Intell. 2023 Feb;45(2):1474-1488. doi: 10.1109/TPAMI.2022.3157033. Epub 2023 Jan 6.

A Survey on Vision Transformer.视觉Transformer综述

IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):87-110. doi: 10.1109/TPAMI.2022.3152247. Epub 2022 Dec 5.

Netpro2vec: A Graph Embedding Framework for Biomedical Applications.Netpro2vec：一种用于生物医学应用的图嵌入框架。

IEEE/ACM Trans Comput Biol Bioinform. 2022 Mar-Apr;19(2):729-740. doi: 10.1109/TCBB.2021.3078089. Epub 2022 Apr 1.

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions.深度学习综述：概念、卷积神经网络架构、挑战、应用及未来方向。

J Big Data. 2021;8(1):53. doi: 10.1186/s40537-021-00444-8. Epub 2021 Mar 31.

Deep High-Resolution Representation Learning for Visual Recognition.用于视觉识别的深度高分辨率表征学习

IEEE Trans Pattern Anal Mach Intell. 2021 Oct;43(10):3349-3364. doi: 10.1109/TPAMI.2020.2983686. Epub 2021 Sep 2.

NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding.NTU RGB+D 120：用于三维人体活动理解的大规模基准测试。

IEEE Trans Pattern Anal Mach Intell. 2020 Oct;42(10):2684-2701. doi: 10.1109/TPAMI.2019.2916873. Epub 2019 May 14.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

卷积神经网络还是视觉Transformer：谁将在视觉数据中的动作识别竞赛中胜出？

Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献