Qi Baolong, Wu Baoyuan, Sun Bailing
Office of Security Department, Chengdu Sport University, Chengdu, 610041, China.
School of Wushu, Chengdu Sport University, Chengdu, 610041, China.
Sci Rep. 2025 Aug 12;15(1):29497. doi: 10.1038/s41598-025-12531-4.
Fistfight detection in video data is a critical task in video surveillance systems, where identifying physical altercations in real-time can enhance safety and security in public spaces. Earlier techniques primarily emphasized capturing inter-person interactions and combining individual characteristics into group-based representations, often overlooking the critical intra-person dynamics within the human bodypose point framework. However, essential individual features can be extracted by examining human skeletal movements' progression and temporal patterns. This paper presents a novel multimodal spatio-temporal fistfight detection model (MSTFDet) that integrates RGB images and human skeletal data to identify violent behaviors accurately. The proposed framework leverages both Context-Aware Encoded Transformer (CAET) for modeling interactions between individuals and their environment and Spatial-Temporal Graph Convolutional Networks (ST-GCN) for capturing intra-person and inter-person dynamics from skeletal data. The RGB module uses a combination of spatial and temporal transformers to model contextual relationships and individual actions, while the bodypose-point module processes skeletal data to capture the fine-grained motion of individuals. We conduct evaluations on two public datasets: the Surveillance Camera Fight Dataset (SCFD) and the RWF-2000 dataset, which feature complex real-world scenarios. On the SCFD and RWF-2000 datasets, MSTFDet achieved a Multi-class classification accuracy (MCA) of 92.3% and 95.2% MCA, respectively. These results highlight the effectiveness of the proposed approach in capturing both spatial and temporal features, providing a robust solution for real-time fistfight detection in diverse and challenging environments.
视频数据中的斗殴检测是视频监控系统中的一项关键任务,在公共场所实时识别肢体冲突可以提高安全性。早期技术主要强调捕捉人际交互并将个体特征组合成基于群体的表示,往往忽略了人体姿态点框架内关键的个体内部动态。然而,通过检查人体骨骼运动的进程和时间模式可以提取重要的个体特征。本文提出了一种新颖的多模态时空斗殴检测模型(MSTFDet),该模型集成了RGB图像和人体骨骼数据以准确识别暴力行为。所提出的框架利用上下文感知编码变压器(CAET)对个体与其环境之间的交互进行建模,并利用时空图卷积网络(ST-GCN)从骨骼数据中捕捉个体内部和人际动态。RGB模块使用空间和时间变压器的组合来建模上下文关系和个体动作,而人体姿态点模块处理骨骼数据以捕捉个体的细粒度运动。我们在两个公共数据集上进行评估:监控摄像头斗殴数据集(SCFD)和RWF-2000数据集,它们具有复杂的现实世界场景。在SCFD和RWF-2000数据集上,MSTFDet分别实现了92.3%和95.2%的多类分类准确率(MCA)。这些结果突出了所提出方法在捕捉空间和时间特征方面的有效性,为在多样且具有挑战性的环境中进行实时斗殴检测提供了一个强大的解决方案。