Suppr超能文献

通过消除误报增强声源定位

Enhancing Sound Source Localization via False Negative Elimination.

作者信息

Song Zengjie, Zhang Jiangshe, Wang Yuxi, Fan Junsong, Zhang Zhaoxiang

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):10499-10514. doi: 10.1109/TPAMI.2024.3444029. Epub 2024 Nov 6.

Abstract

Sound source localization aims to localize objects emitting the sound in visual scenes. Recent works obtaining impressive results typically rely on contrastive learning. However, the common practice of randomly sampling negatives in prior arts can lead to the false negative issue, where the sounds semantically similar to visual instance are sampled as negatives and incorrectly pushed away from the visual anchor/query. As a result, this misalignment of audio and visual features could yield inferior performance. To address this issue, we propose a novel audio-visual learning framework which is instantiated with two individual learning schemes: self-supervised predictive learning (SSPL) and semantic-aware contrastive learning (SACL). SSPL explores image-audio positive pairs alone to discover semantically coherent similarities between audio and visual features, while a predictive coding module for feature alignment is introduced to facilitate the positive-only learning. In this regard SSPL acts as a negative-free method to eliminate false negatives. By contrast, SACL is designed to compact visual features and remove false negatives, providing reliable visual anchor and audio negatives for contrast. Different from SSPL, SACL releases the potential of audio-visual contrastive learning, offering an effective alternative to achieve the same goal. Comprehensive experiments demonstrate the superiority of our approach over the state-of-the-arts. Furthermore, we highlight the versatility of the learned representation by extending the approach to audio-visual event classification and object detection tasks.

摘要

声源定位旨在在视觉场景中定位发出声音的物体。最近取得令人印象深刻成果的工作通常依赖于对比学习。然而,现有技术中随机采样负样本的常见做法可能会导致假阴性问题,即与视觉实例语义相似的声音被采样为负样本,并被错误地与视觉锚点/查询分开。因此,音频和视觉特征的这种不匹配可能会导致性能下降。为了解决这个问题,我们提出了一种新颖的视听学习框架,该框架由两种独立的学习方案实例化:自监督预测学习(SSPL)和语义感知对比学习(SACL)。SSPL单独探索图像-音频正样本对,以发现音频和视觉特征之间语义连贯的相似性,同时引入一个用于特征对齐的预测编码模块来促进仅正样本学习。在这方面,SSPL作为一种无负样本方法来消除假阴性。相比之下,SACL旨在压缩视觉特征并消除假阴性,为对比提供可靠的视觉锚点和音频负样本。与SSPL不同,SACL释放了视听对比学习的潜力,提供了实现相同目标的有效替代方案。综合实验证明了我们的方法优于现有技术。此外,我们通过将该方法扩展到视听事件分类和目标检测任务,突出了所学表示的通用性。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验