Department of Computer Science, Universidad Católica San Pablo, Arequipa 04001, Peru.
Department of Computer Science, Federal University of Ouro Preto, Ouro Preto 35400-000, Brazil.
Sensors (Basel). 2022 Jun 14;22(12):4502. doi: 10.3390/s22124502.
Automatic violence detection in video surveillance is essential for social and personal security. Monitoring the large number of surveillance cameras used in public and private areas is challenging for human operators. The manual nature of this task significantly increases the possibility of ignoring important events due to human limitations when paying attention to multiple targets at a time. Researchers have proposed several methods to detect violent events automatically to overcome this problem. So far, most previous studies have focused only on classifying short clips without performing spatial localization. In this work, we tackle this problem by proposing a weakly supervised method to detect spatially and temporarily violent actions in surveillance videos using only video-level labels. The proposed method follows a Fast-RCNN style architecture, that has been temporally extended. First, we generate spatiotemporal proposals (action tubes) leveraging pre-trained person detectors, motion appearance (dynamic images), and tracking algorithms. Then, given an input video and the action proposals, we extract spatiotemporal features using deep neural networks. Finally, a classifier based on multiple-instance learning is trained to label each action tube as violent or non-violent. We obtain similar results to the state of the art in three public databases Hockey Fight, RLVSD, and RWF-2000, achieving an accuracy of 97.3%, 92.88%, 88.7%, respectively.
自动的视频监控中的暴力检测对于社会和个人安全至关重要。监控公共和私人区域中使用的大量监控摄像机对人类操作员来说具有挑战性。由于人类在同时关注多个目标时存在注意力的限制,因此这种任务的人工性质大大增加了忽略重要事件的可能性。研究人员已经提出了几种自动检测暴力事件的方法来克服这个问题。到目前为止,大多数先前的研究仅关注于不进行空间定位的短片段分类。在这项工作中,我们通过提出一种仅使用视频级标签来检测监控视频中空间和时间暴力行为的弱监督方法来解决这个问题。所提出的方法遵循 Fast-RCNN 风格的架构,该架构已经在时间上进行了扩展。首先,我们利用预先训练好的人体检测器、运动外观(动态图像)和跟踪算法生成时空建议(动作管)。然后,给定一个输入视频和动作建议,我们使用深度神经网络提取时空特征。最后,基于多实例学习的分类器被训练来标记每个动作管是暴力的还是非暴力的。我们在三个公共数据库 Hockey Fight、RLVSD 和 RWF-2000 上获得了与最先进技术相当的结果,分别达到了 97.3%、92.88%和 88.7%的准确率。