Sharif Md Haidar, Jiao Lei, Omlin Christian W
Department of ICT, University of Agder, 4630 Kristiansand, Norway.
Sensors (Basel). 2023 Sep 7;23(18):7734. doi: 10.3390/s23187734.
Video anomaly event detection (VAED) is one of the key technologies in computer vision for smart surveillance systems. With the advent of deep learning, contemporary advances in VAED have achieved substantial success. Recently, weakly supervised VAED (WVAED) has become a popular VAED technical route of research. WVAED methods do not depend on a supplementary self-supervised substitute task, yet they can assess anomaly scores straightway. However, the performance of WVAED methods depends on pretrained feature extractors. In this paper, we first address taking advantage of two pretrained feature extractors for CNN (e.g., C3D and I3D) and ViT (e.g., CLIP), for effectively extracting discerning representations. We then consider long-range and short-range temporal dependencies and put forward video snippets of interest by leveraging our proposed temporal self-attention network (TSAN). We design a multiple instance learning (MIL)-based generalized architecture named CNN-ViT-TSAN, by using CNN- and/or ViT-extracted features and TSAN to specify a series of models for the WVAED problem. Experimental results on publicly available popular crowd datasets demonstrated the effectiveness of our CNN-ViT-TSAN.
视频异常事件检测(VAED)是智能监控系统计算机视觉中的关键技术之一。随着深度学习的出现,VAED的当代进展取得了巨大成功。最近,弱监督VAED(WVAED)已成为一种流行的VAED技术研究路线。WVAED方法不依赖于补充的自监督替代任务,但它们可以直接评估异常分数。然而,WVAED方法的性能取决于预训练的特征提取器。在本文中,我们首先利用两个用于卷积神经网络(CNN)(例如C3D和I3D)和视觉Transformer(ViT)(例如CLIP)的预训练特征提取器,以有效提取有辨别力的表示。然后,我们考虑长程和短程时间依赖性,并通过利用我们提出的时间自注意力网络(TSAN)提出感兴趣的视频片段。我们设计了一种基于多实例学习(MIL)的广义架构,名为CNN-ViT-TSAN,通过使用CNN和/或ViT提取的特征以及TSAN为WVAED问题指定一系列模型。在公开可用的流行人群数据集上的实验结果证明了我们的CNN-ViT-TSAN的有效性。