使用视频和音频融合在复杂环境中进行多模态异常检测。

Multimodal anomaly detection in complex environments using video and audio fusion.

作者信息

Wang Yuanyuan, Zhao Yijie, Huo Yanhua, Lu Yiping

机构信息

Library, Hebei North University, Zhangjiakou, 075000, Hebei, China.

Educational Technology and Information Center, Hebei North University, Zhangjiakou, 075000, Hebei, China.

出版信息

Sci Rep. 2025 May 10;15(1):16291. doi: 10.1038/s41598-025-01146-4.

DOI:10.1038/s41598-025-01146-4

PMID:40348836

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12065840/

Abstract

Due to complex environmental conditions and varying noise levels, traditional models are limited in their effectiveness for detecting anomalies in video sequences. Aiming at the challenges of accuracy, robustness, and real-time processing requirements in the field of image and video processing, this study proposes an anomaly detection and recognition algorithm for video image data based on deep learning. The algorithm combines the innovative methods of spatio-temporal feature extraction and noise suppression, and aims to improve the processing performance, especially in complex environments, by introducing an improved Variable Auto Encoder (VAE) structure. The model named Spatio-Temporal Anomaly Detection Network (STADNet) captures the spatio-temporal features of video images through multi-scale Three-Dimensional (3D) convolution module and spatio-temporal attention mechanism. This approach improves the accuracy of anomaly detection. Multi-stream network architecture and cross-attention fusion mechanism are also adopted to comprehensively consider different factors such as color, texture, and motion, and further improve the robustness and generalization ability of the model. The experimental results show that compared with the existing models, the new model has obvious advantages in performance stability and real-time processing under different noise levels. Specifically, the AUC value of the proposed model is 0.95 on UCSD Ped2 dataset, which is about 10% higher than other models, and the AUC value on Avenue dataset is 0.93, which is about 12% higher. This study not only proposes an effective image and video processing scheme but also demonstrates wide practical potential, providing a new perspective and methodological basis for future research and application in related fields.

摘要

由于环境条件复杂且噪声水平各异，传统模型在检测视频序列中的异常方面效果有限。针对图像和视频处理领域中准确性、鲁棒性及实时处理要求等挑战，本研究提出一种基于深度学习的视频图像数据异常检测与识别算法。该算法结合了时空特征提取与噪声抑制的创新方法，旨在通过引入改进的可变自动编码器（VAE）结构来提高处理性能，尤其是在复杂环境中的性能。名为时空异常检测网络（STADNet）的模型通过多尺度三维（3D）卷积模块和时空注意力机制捕捉视频图像的时空特征。这种方法提高了异常检测的准确性。还采用了多流网络架构和交叉注意力融合机制，以综合考虑颜色、纹理和运动等不同因素，并进一步提高模型的鲁棒性和泛化能力。实验结果表明，与现有模型相比新模型在不同噪声水平下的性能稳定性和实时处理方面具有明显优势。具体而言，所提模型在UCSD Ped2数据集上的AUC值为0.95，比其他模型高出约10%，在Avenue数据集上的AUC值为0.93，高出约12%。本研究不仅提出了一种有效的图像和视频处理方案，还展示了广泛的实际潜力，为相关领域未来的研究和应用提供了新的视角和方法基础。