Yadav Santosh Kumar, Luthra Achleshwar, Pahwa Esha, Tiwari Kamlesh, Rathore Heena, Pandey Hari Mohan, Corcoran Peter
College of Science and Engineering, National University of Ireland, Galway, H91TK33, Ireland; CogniX, Quadrant-2, 10th Floor, Cyber Towers, Madhapur, Hyderabad, Telangana 500081, India.
Department of CSIS, Birla Institute of Technology and Science Pilani, Pilani Campus, Rajasthan 333031, India.
Neural Netw. 2023 Feb;159:57-69. doi: 10.1016/j.neunet.2022.12.005. Epub 2022 Dec 13.
Human activity recognition (HAR) using drone-mounted cameras has attracted considerable interest from the computer vision research community in recent years. A robust and efficient HAR system has a pivotal role in fields like video surveillance, crowd behavior analysis, sports analysis, and human-computer interaction. What makes it challenging are the complex poses, understanding different viewpoints, and the environmental scenarios where the action is taking place. To address such complexities, in this paper, we propose a novel Sparse Weighted Temporal Attention (SWTA) module to utilize sparsely sampled video frames for obtaining global weighted temporal attention. The proposed SWTA is comprised of two parts. First, temporal segment network that sparsely samples a given set of frames. Second, weighted temporal attention, which incorporates a fusion of attention maps derived from optical flow, with raw RGB images. This is followed by a basenet network, which comprises a convolutional neural network (CNN) module along with fully connected layers that provide us with activity recognition. The SWTA network can be used as a plug-in module to the existing deep CNN architectures, for optimizing them to learn temporal information by eliminating the need for a separate temporal stream. It has been evaluated on three publicly available benchmark datasets, namely Okutama, MOD20, and Drone-Action. The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets thereby surpassing the previous state-of-the-art performances by a margin of 25.26%, 18.56%, and 2.94%, respectively.
近年来,利用无人机搭载摄像头进行人体活动识别(HAR)引起了计算机视觉研究界的广泛关注。一个强大且高效的HAR系统在视频监控、人群行为分析、体育分析和人机交互等领域起着关键作用。其具有挑战性的地方在于复杂的姿势、对不同视角的理解以及动作发生的环境场景。为了解决这些复杂性问题,在本文中,我们提出了一种新颖的稀疏加权时间注意力(SWTA)模块,以利用稀疏采样的视频帧来获得全局加权时间注意力。所提出的SWTA由两部分组成。第一,时间片段网络,它对给定的一组帧进行稀疏采样。第二,加权时间注意力,它将从光流导出的注意力图与原始RGB图像进行融合。接下来是一个基础网络,它由一个卷积神经网络(CNN)模块以及全连接层组成,为我们提供活动识别。SWTA网络可以用作现有深度CNN架构的插件模块,通过消除对单独时间流的需求来优化它们以学习时间信息。我们在三个公开可用的基准数据集上对其进行了评估,即奥多摩、MOD20和无人机动作数据集。所提出模型在各个数据集上分别取得了72.76%、92.56%和78.86%的准确率,从而分别比之前的最优性能高出25.26%、18.56%和2.94%。