Liu Tianshan, Lam Kin-Man, Kong Jun
IEEE Trans Neural Netw Learn Syst. 2024 Sep;35(9):12627-12641. doi: 10.1109/TNNLS.2023.3263966. Epub 2024 Sep 3.
Weakly supervised video anomaly detection (WS-VAD) aims to identify the snippets involving anomalous events in long untrimmed videos, with solely text video-level binary labels. A typical paradigm among the existing text WS-VAD methods is to employ multiple modalities as inputs, e.g., RGB, optical flow, and audio, as they can provide sufficient discriminative clues that are robust to the diverse, complicated real-world scenes. However, such a pipeline has high reliance on the availability of multiple modalities and is computationally expensive and storage demanding in processing long sequences, which limits its use in some applications. To address this dilemma, we propose a privileged knowledge distillation (KD) framework dedicated to the WS-VAD task, which can maintain the benefits of exploiting additional modalities, while avoiding the need for using multimodal data in the inference phase. We argue that the performance of the privileged KD framework mainly depends on two factors: 1) the effectiveness of the multimodal teacher network and 2) the completeness of the useful information transfer. To obtain a reliable teacher network, we propose a text cross-modal interactive learning strategy and an anomaly normal discrimination loss, which target learning task-specific cross-modal features and encourage the separability of anomalous and normal representations, respectively. Furthermore, we design both representation- and text logits-level distillation loss functions, which force the unimodal student network to distill abundant privileged knowledge from the text well-trained multimodal teacher network, in a snippet-to-video fashion. Extensive experimental results on three public benchmarks demonstrate that the proposed privileged KD framework can train a lightweight yet effective detector, for localizing anomaly events under the supervision of video-level annotations.
弱监督视频异常检测(WS-VAD)旨在识别长未修剪视频中涉及异常事件的片段,仅使用文本视频级二进制标签。现有文本WS-VAD方法中的一种典型范式是采用多种模态作为输入,例如RGB、光流和音频,因为它们可以提供足够的判别线索,对多样、复杂的现实世界场景具有鲁棒性。然而,这样的管道高度依赖多种模态的可用性,并且在处理长序列时计算成本高且存储需求大,这限制了其在某些应用中的使用。为了解决这一困境,我们提出了一种专门用于WS-VAD任务的特权知识蒸馏(KD)框架,该框架可以保持利用额外模态的好处,同时避免在推理阶段使用多模态数据的需求。我们认为特权KD框架的性能主要取决于两个因素:1)多模态教师网络的有效性和2)有用信息传递的完整性。为了获得可靠的教师网络,我们提出了一种文本跨模态交互学习策略和一种异常正常判别损失,分别针对学习特定任务的跨模态特征和鼓励异常与正常表示的可分离性。此外,我们设计了表示级和文本逻辑级蒸馏损失函数,以片段到视频的方式迫使单模态学生网络从经过良好训练的多模态教师网络中蒸馏出丰富的特权知识。在三个公共基准上的大量实验结果表明,所提出的特权KD框架可以训练一个轻量级但有效的检测器,用于在视频级注释的监督下定位异常事件。