Shah Mubashir, Nawaz Tahir, Nawaz Rab, Rashid Nasir, Ali Muhammad Osama
Department of Mechatronics Engineering, College of Electrical and Mechanical Engineering, National University of Sciences and Technology, Islamabad, Pakistan.
Deep Learning Lab, School of Interdisciplinary Engineering and Science, National University of Sciences and Technology, Islamabad, Pakistan.
PLoS One. 2025 May 14;20(5):e0323314. doi: 10.1371/journal.pone.0323314. eCollection 2025.
Human action recognition forms an important part of several aerial security and surveillance applications. Indeed, numerous efforts have been made to solve the problem in an effective and efficient manner. Existing methods, however, are generally aimed to recognize either solo actions or interactions, thus restricting their use to specific scenarios. Additionally, the need remains to devise lightweight and computationally efficient models to make them deployable in real-world applications. To this end, this paper presents a generic lightweight and computationally efficient Transformer network-based model, referred to as InterAcT, that relies on extracted bodily keypoints using YOLO v8 to recognize human solo actions as well as interactions in aerial videos. It features a lightweight architecture with 0.0709M parameters and 0.0389G flops, distinguishing it from the AcT models. An extensive performance evaluation has been performed on two publicly available aerial datasets: Drone Action and UT-Interaction, comprising a total of 18 classes including both solo actions and interactions. The model is optimized and trained on 80% train set, 10% validation set and its performance is evaluated on 10% test set achieving highly encouraging performance on multiple benchmarks, outperforming several state-of-the-art methods. Our model, with an accuracy of 0.9923 outperforms the AcT models (micro: 0.9353, small: 0.9893, base: 0.9907, and large: 0.9558), 2P-GCN (0.9337), LSTM (0.9774), 3D-ResNet (0.9921), and 3D CNN (0.9920). It has the strength to recognize a large number of solo actions and two-person interaction classes both in aerial videos and footage from ground-level cameras (grayscale and RGB).
人体动作识别是多种空中安全与监控应用的重要组成部分。事实上,人们已经做出了许多努力来有效且高效地解决这个问题。然而,现有方法通常旨在识别单独动作或交互动作,因此其应用局限于特定场景。此外,仍需要设计轻量级且计算效率高的模型,以便能够在实际应用中部署。为此,本文提出了一种基于Transformer网络的通用轻量级且计算高效的模型,称为InterAcT,它利用YOLO v8提取的身体关键点来识别空中视频中的人体单独动作以及交互动作。它具有轻量级架构,参数为0.0709M,浮点运算次数为0.0389G,这使其有别于AcT模型。我们在两个公开可用的空中数据集上进行了广泛的性能评估:无人机动作数据集(Drone Action)和UT交互数据集(UT-Interaction),总共包含18类,包括单独动作和交互动作。该模型在80%的训练集、10%的验证集上进行优化和训练,并在10%的测试集上评估其性能,在多个基准测试中取得了令人鼓舞的性能,优于几种现有最先进的方法。我们的模型准确率为0.9923,优于AcT模型(微型:0.9353,小型:0.9893,基础型:0.9907,大型:0.9558)、2P-GCN(0.9337)、LSTM(0.9774)、3D-ResNet(0.9921)和3D CNN(0.9920)。它有能力识别空中视频以及地面摄像头(灰度和RGB)拍摄的画面中的大量单独动作和两人交互类别。