• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

双耳声音网络:利用双耳声音预测语义、深度和运动

Binaural SoundNet: Predicting Semantics, Depth and Motion With Binaural Sounds.

作者信息

Dai Dengxin, Vasudevan Arun Balajee, Matas Jiri, Van Gool Luc

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):123-136. doi: 10.1109/TPAMI.2022.3155643. Epub 2022 Dec 5.

DOI:10.1109/TPAMI.2022.3155643
PMID:35239475
Abstract

Humans can robustly recognize and localize objects by using visual and/or auditory cues. While machines are able to do the same with visual data already, less work has been done with sounds. This work develops an approach for scene understanding purely based on binaural sounds. The considered tasks include predicting the semantic masks of sound-making objects, the motion of sound-making objects, and the depth map of the scene. To this aim, we propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360 camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of multiple vision 'teacher' methods and a sound 'student' method - the student method is trained to generate the same results as the teacher methods do. This way, the auditory system can be trained without using human annotations. To further boost the performance, we propose another novel auxiliary task, coined Spatial Sound Super-Resolution, to increase the directional resolution of sounds. We then formulate the four tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results show that 1) our method achieves good results for all four tasks, 2) the four tasks are mutually beneficial - training them together achieves the best performance, 3) the number and orientation of microphones are both important, and 4) features learned from the standard spectrogram and features obtained by the classic signal processing pipeline are complementary for auditory perception tasks. The data and code are released on the project page: https://www.trace.ethz.ch/publications/2020/sound_perception/index.html.

摘要

人类可以通过视觉和/或听觉线索稳健地识别物体并确定其位置。虽然机器已经能够对视觉数据做到这一点,但在声音方面所做的工作较少。这项工作开发了一种完全基于双耳声音的场景理解方法。所考虑的任务包括预测发声物体的语义掩码、发声物体的运动以及场景的深度图。为此,我们提出了一种新颖的传感器设置,并使用八个专业双耳麦克风和一个360度摄像头记录了一个新的街景视听数据集。利用视觉和音频线索的共存进行监督转移。具体来说,我们采用了一个跨模态蒸馏框架,该框架由多种视觉“教师”方法和一种声音“学生”方法组成——训练学生方法以生成与教师方法相同的结果。这样,听觉系统可以在不使用人工标注的情况下进行训练。为了进一步提高性能,我们提出了另一个新颖的辅助任务,即空间声音超分辨率,以提高声音的方向分辨率。然后,我们将这四个任务制定为一个端到端可训练的多任务网络,旨在提高整体性能。实验结果表明:1)我们的方法在所有四个任务上都取得了良好的结果;2)这四个任务相互有益——一起训练它们可实现最佳性能;3)麦克风的数量和方向都很重要;4)从标准频谱图中学习的特征和通过经典信号处理管道获得的特征在听觉感知任务中是互补的。数据和代码已在项目页面上发布:https://www.trace.ethz.ch/publications/2020/sound_perception/index.html。

相似文献

1
Binaural SoundNet: Predicting Semantics, Depth and Motion With Binaural Sounds.双耳声音网络:利用双耳声音预测语义、深度和运动
IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):123-136. doi: 10.1109/TPAMI.2022.3155643. Epub 2022 Dec 5.
2
Re-weighting of Sound Localization Cues by Audiovisual Training.通过视听训练对声音定位线索进行重新加权
Front Neurosci. 2019 Nov 15;13:1164. doi: 10.3389/fnins.2019.01164. eCollection 2019.
3
Off-Screen Sound Separation Based on Audio-visual Pre-training Using Binaural Audio.基于使用双耳音频的视听预训练的离屏声音分离。
Sensors (Basel). 2023 May 7;23(9):4540. doi: 10.3390/s23094540.
4
Development and evaluation of the LiSN & learn auditory training software for deficit-specific remediation of binaural processing deficits in children: preliminary findings.用于儿童双耳加工缺陷特定缺陷补救的LiSN&learn听觉训练软件的开发与评估:初步结果
J Am Acad Audiol. 2011 Nov-Dec;22(10):678-96. doi: 10.3766/jaaa.22.10.6.
5
Decoding natural scenes based on sounds of objects within scenes using multivariate pattern analysis.基于场景中物体声音的多元模式分析来解码自然场景。
Neurosci Res. 2019 Nov;148:9-18. doi: 10.1016/j.neures.2018.11.009. Epub 2018 Dec 1.
6
Efficient coding of spectrotemporal binaural sounds leads to emergence of the auditory space representation.声谱时频双耳声音的有效编码导致听觉空间表示的出现。
Front Comput Neurosci. 2014 Mar 7;8:26. doi: 10.3389/fncom.2014.00026. eCollection 2014.
7
Auditory and Semantic Cues Facilitate Decoding of Visual Object Category in MEG.听觉和语义线索有助于在 MEG 中解码视觉物体类别。
Cereb Cortex. 2020 Mar 21;30(2):597-606. doi: 10.1093/cercor/bhz110.
8
Spatial shifts of audio-visual interactions by perceptual learning are specific to the trained orientation and eye.通过知觉学习实现的视听交互的空间转移特定于训练的方向和眼睛。
Seeing Perceiving. 2011;24(6):579-94. doi: 10.1163/187847611X603738.
9
Statistics of natural binaural sounds.自然双耳声音的统计数据。
PLoS One. 2014 Oct 6;9(10):e108968. doi: 10.1371/journal.pone.0108968. eCollection 2014.
10
Behavioral semantics of learning and crossmodal processing in auditory cortex: the semantic processor concept.听觉皮层中的学习和跨模态处理的行为语义:语义处理器概念。
Hear Res. 2011 Jan;271(1-2):3-15. doi: 10.1016/j.heares.2010.10.006. Epub 2010 Oct 29.