Pu Yujiang, Wu Xiaoyu, Yang Lulu, Wang Shengjin
IEEE Trans Image Process. 2024;33:4923-4936. doi: 10.1109/TIP.2024.3451935. Epub 2024 Sep 11.
Weakly supervised video anomaly detection aims to locate abnormal activities in untrimmed videos without the need for frame-level supervision. Prior work has utilized graph convolution networks or self-attention mechanisms alongside multiple instance learning (MIL)-based classification loss to model temporal relations and learn discriminative features. However, these approaches are limited in two aspects: 1) Multi-branch parallel architectures, while capturing multi-scale temporal dependencies, inevitably lead to increased parameter and computational costs. 2) The binarized MIL constraint only ensures the interclass separability while neglecting the fine-grained discriminability within anomalous classes. To this end, we introduce a novel WS-VAD framework that focuses on efficient temporal modeling and anomaly innerclass discriminability. We first construct a Temporal Context Aggregation (TCA) module that simultaneously captures local-global dependencies by reusing an attention matrix along with adaptive context fusion. In addition, we propose a Prompt-Enhanced Learning (PEL) module that incorporates semantic priors using knowledge-based prompts to boost the discrimination of visual features while ensuring separability across anomaly subclasses. The proposed components have been validated through extensive experiments, which demonstrate superior performance on three challenging datasets, UCF-Crime, XD-Violence and ShanghaiTech, with fewer parameters and reduced computational effort. Notably, our method can significantly improve the detection accuracy for certain anomaly subclasses and reduced the false alarm rate. Our code is available at: https://github.com/yujiangpu20/PEL4VAD.
弱监督视频异常检测旨在定位未修剪视频中的异常活动,而无需帧级监督。先前的工作利用图卷积网络或自注意力机制,结合基于多实例学习(MIL)的分类损失来建模时间关系并学习判别特征。然而,这些方法在两个方面存在局限性:1)多分支并行架构在捕获多尺度时间依赖性的同时,不可避免地导致参数和计算成本增加。2)二值化的MIL约束仅确保类间可分离性,而忽略了异常类内的细粒度可辨别性。为此,我们引入了一种新颖的WS-VAD框架,该框架专注于高效的时间建模和异常类内可辨别性。我们首先构建了一个时间上下文聚合(TCA)模块,通过重用注意力矩阵以及自适应上下文融合来同时捕获局部-全局依赖性。此外,我们提出了一种提示增强学习(PEL)模块,该模块使用基于知识的提示合并语义先验,以增强视觉特征的辨别力,同时确保跨异常子类的可分离性。所提出的组件已通过广泛的实验得到验证,这些实验表明在三个具有挑战性的数据集UCF-Crime、XD-Violence和上海科技数据集上具有卓越的性能,参数更少且计算量减少。值得注意的是,我们的方法可以显著提高某些异常子类的检测准确率并降低误报率。我们的代码可在以下网址获取:https://github.com/yujiangpu20/PEL4VAD。