Institute of Intelligent Manufacturing, Guangdong Academy of Sciences, Guangdong Key Laboratory of Modern Control Technology, Guangzhou, China.
School of Physics and Technology, Wuhan University, China.
Neural Netw. 2021 Jan;133:229-239. doi: 10.1016/j.neunet.2020.10.003. Epub 2020 Nov 11.
Videos are used widely as the media platforms for human beings to touch the physical change of the world. However, we always receive the mixed sound from the multiple sound objects, and cannot distinguish and localize the sounds as the separate entities in videos. In order to solve this problem, a model named the Deep Multi-Modal Attention Network (DMMAN), is established to model the unconstrained video datasets for further finishing the sound source separation and event localization tasks in this paper. Based on the multi-modal separator and multi-modal matching classifier module, our model focuses on the sound separation and modal synchronization problems using two stage fusion of the sound and visual features. To link the multi-modal separator and multi-modal matching classifier modules, the regression and classification losses are employed to build the loss function of the DMMAN. The estimated spectrum masks and attention synchronization scores calculated by the DMMAN can be easily generalized to the sound source and event localization tasks. The quantitative experimental results show the DMMAN not only separates the high quality of the sound sources evaluated by Signal-to-Distortion Ratio and Signal-to-Interference Ratio metrics, but also is suitable for the mixed sound scenes that are never heard jointly. Meanwhile, DMMAN achieves better classification accuracy than other contrast baselines for the event localization tasks.
视频作为人类感知世界物理变化的媒介被广泛应用。然而,我们总是从多个声源中接收到混合的声音,并且无法将声音作为视频中的独立实体进行区分和定位。为了解决这个问题,本文建立了一个名为深度多模态注意网络(DMMAN)的模型,用于对无约束视频数据集进行建模,以进一步完成声源分离和事件定位任务。基于多模态分离器和多模态匹配分类器模块,我们的模型通过声音和视觉特征的两阶段融合,专注于声音分离和模态同步问题。为了连接多模态分离器和多模态匹配分类器模块,我们使用回归和分类损失来构建 DMMAN 的损失函数。由 DMMAN 计算出的估计频谱掩模和注意力同步分数可以轻松推广到声源和事件定位任务。定量实验结果表明,DMMAN 不仅可以分离高质量的声源,评估指标包括信号失真比和信号干扰比,而且还适用于从未共同听到过的混合声音场景。同时,对于事件定位任务,DMMAN 比其他对比基线具有更高的分类准确性。