Moutik Oumaima, Sekkat Hiba, Tigani Smail, Chehri Abdellah, Saadane Rachid, Tchakoucht Taha Ait, Paul Anand
Engineering Unit, Euromed Research Center, Euro-Mediterranean University, Fes 30030, Morocco.
Department of Mathematics and Computer Science, Royal Military College of Canada, Kingston, ON 11 K7K 7B4, Canada.
Sensors (Basel). 2023 Jan 9;23(2):734. doi: 10.3390/s23020734.
Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis's outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.
理解视频中的动作仍然是计算机视觉领域的一项重大挑战,在过去几十年里一直是多项研究的主题。卷积神经网络(CNN)是该主题的重要组成部分,在深度学习的声誉中发挥着关键作用。受人类视觉系统的启发,CNN已被应用于视觉数据利用,并在各种计算机视觉任务和视频/图像分析中解决了各种挑战,包括动作识别(AR)。然而,不久前,随着Transformer在自然语言处理(NLP)方面取得的成就,它开始在视觉任务中引领新趋势,这引发了关于视觉Transformer模型(ViT)是否会在视频片段的动作识别中取代CNN的讨论。本文详细探讨了这个热门话题,分别研究了用于动作识别的CNN和Transformer,并对准确性与复杂度的权衡进行了比较研究。最后,基于性能分析的结果,将讨论CNN还是视觉Transformer将赢得这场竞赛的问题。