• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

DMMAN:一种用于声音分离和事件定位的两阶段视听融合框架。

DMMAN: A two-stage audio-visual fusion framework for sound separation and event localization.

机构信息

Institute of Intelligent Manufacturing, Guangdong Academy of Sciences, Guangdong Key Laboratory of Modern Control Technology, Guangzhou, China.

School of Physics and Technology, Wuhan University, China.

出版信息

Neural Netw. 2021 Jan;133:229-239. doi: 10.1016/j.neunet.2020.10.003. Epub 2020 Nov 11.

DOI:10.1016/j.neunet.2020.10.003
PMID:33232859
Abstract

Videos are used widely as the media platforms for human beings to touch the physical change of the world. However, we always receive the mixed sound from the multiple sound objects, and cannot distinguish and localize the sounds as the separate entities in videos. In order to solve this problem, a model named the Deep Multi-Modal Attention Network (DMMAN), is established to model the unconstrained video datasets for further finishing the sound source separation and event localization tasks in this paper. Based on the multi-modal separator and multi-modal matching classifier module, our model focuses on the sound separation and modal synchronization problems using two stage fusion of the sound and visual features. To link the multi-modal separator and multi-modal matching classifier modules, the regression and classification losses are employed to build the loss function of the DMMAN. The estimated spectrum masks and attention synchronization scores calculated by the DMMAN can be easily generalized to the sound source and event localization tasks. The quantitative experimental results show the DMMAN not only separates the high quality of the sound sources evaluated by Signal-to-Distortion Ratio and Signal-to-Interference Ratio metrics, but also is suitable for the mixed sound scenes that are never heard jointly. Meanwhile, DMMAN achieves better classification accuracy than other contrast baselines for the event localization tasks.

摘要

视频作为人类感知世界物理变化的媒介被广泛应用。然而,我们总是从多个声源中接收到混合的声音,并且无法将声音作为视频中的独立实体进行区分和定位。为了解决这个问题,本文建立了一个名为深度多模态注意网络(DMMAN)的模型,用于对无约束视频数据集进行建模,以进一步完成声源分离和事件定位任务。基于多模态分离器和多模态匹配分类器模块,我们的模型通过声音和视觉特征的两阶段融合,专注于声音分离和模态同步问题。为了连接多模态分离器和多模态匹配分类器模块,我们使用回归和分类损失来构建 DMMAN 的损失函数。由 DMMAN 计算出的估计频谱掩模和注意力同步分数可以轻松推广到声源和事件定位任务。定量实验结果表明,DMMAN 不仅可以分离高质量的声源,评估指标包括信号失真比和信号干扰比,而且还适用于从未共同听到过的混合声音场景。同时,对于事件定位任务,DMMAN 比其他对比基线具有更高的分类准确性。

相似文献

1
DMMAN: A two-stage audio-visual fusion framework for sound separation and event localization.DMMAN:一种用于声音分离和事件定位的两阶段视听融合框架。
Neural Netw. 2021 Jan;133:229-239. doi: 10.1016/j.neunet.2020.10.003. Epub 2020 Nov 11.
2
The contributions of sensory dominance and attentional bias to cross-modal enhancement of visual cortex excitability.感觉优势和注意力偏向对视觉皮层兴奋性的跨模态增强的贡献。
J Cogn Neurosci. 2013 Jul;25(7):1122-35. doi: 10.1162/jocn_a_00367. Epub 2013 Feb 5.
3
Cognitive integration of asynchronous natural or non-natural auditory and visual information in videos of real-world events: an event-related potential study.异步自然或非自然听觉和视觉信息在真实世界事件视频中的认知整合:一项事件相关电位研究。
Neuroscience. 2011 Apr 28;180:181-90. doi: 10.1016/j.neuroscience.2011.01.066. Epub 2011 Feb 24.
4
Correlation between audio-visual enhancement of speech in different noise environments and SNR: a combined behavioral and electrophysiological study.不同噪声环境下语音的视听增强与 SNR 的相关性:一项结合行为和电生理的研究。
Neuroscience. 2013 Sep 5;247:145-51. doi: 10.1016/j.neuroscience.2013.05.007. Epub 2013 May 11.
5
Neural practice effect during cross-modal selective attention: Supra-modal and modality-specific effects.跨模态选择性注意中的神经练习效应:超模态和模态特异性效应。
Cortex. 2018 Sep;106:47-64. doi: 10.1016/j.cortex.2018.05.003. Epub 2018 May 16.
6
Effect of attention on early cortical processes associated with the sound-induced extra flash illusion.注意对与声音诱导额外闪光错觉相关的早期皮层过程的影响。
J Cogn Neurosci. 2010 Aug;22(8):1714-29. doi: 10.1162/jocn.2009.21295.
7
The influence of matching degrees of synchronous auditory and visual information in videos of real-world events on cognitive integration: an event-related potential study.真实事件视频中视听信息同步匹配程度对认知整合的影响:一项事件相关电位研究。
Neuroscience. 2011 Oct 27;194:19-26. doi: 10.1016/j.neuroscience.2011.08.009. Epub 2011 Aug 10.
8
Sound-aided recovery from and persistence against visual filling-in.声音辅助从视觉填充中恢复并持续抵抗视觉填充。
Vision Res. 2004;44(16):1907-17. doi: 10.1016/j.visres.2004.03.009.
9
Existence of competing modality dominances.存在相互竞争的模态优势。
Atten Percept Psychophys. 2016 May;78(4):1104-14. doi: 10.3758/s13414-016-1061-3.
10
Cross-modal orienting of visual attention.视觉注意的跨模态定向
Neuropsychologia. 2016 Mar;83:170-178. doi: 10.1016/j.neuropsychologia.2015.06.003. Epub 2015 Jun 11.

引用本文的文献

1
ORCA-SPY enables killer whale sound source simulation, detection, classification and localization using an integrated deep learning-based segmentation.ORCA-SPY 使用基于集成深度学习的分割实现虎鲸声源模拟、检测、分类和定位。
Sci Rep. 2023 Jul 10;13(1):11106. doi: 10.1038/s41598-023-38132-7.