Suppr超能文献

用于改进声音事件定位与检测的联合时空频率表示学习

Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection.

作者信息

Chen Baoqing, Wang Mei, Gu Yu

机构信息

School of Information and Communication, Guilin University of Electronic Technology, Guilin 541004, China.

College of Physics and Electronic Information Engineering, Guilin University of Technology, Guilin 541004, China.

出版信息

Sensors (Basel). 2024 Sep 20;24(18):6090. doi: 10.3390/s24186090.

Abstract

Sound event localization and detection (SELD) is a crucial component of machine listening that aims to simultaneously identify and localize sound events in multichannel audio recordings. This task demands an integrated analysis of spatial, temporal, and frequency domains to accurately characterize sound events. The spatial domain pertains to the varying acoustic signals captured by multichannel microphones, which are essential for determining the location of sound sources. However, the majority of recent studies have focused on time-frequency correlations and spatio-temporal correlations separately, leading to inadequate performance in real-life scenarios. In this paper, we propose a novel SELD method that utilizes the newly developed Spatio-Temporal-Frequency Fusion Network (STFF-Net) to jointly learn comprehensive features across spatial, temporal, and frequency domains of sound events. The backbone of our STFF-Net is the Enhanced-3D (E3D) residual block, which combines 3D convolutions with a parameter-free attention mechanism to capture and refine the intricate correlations among these domains. Furthermore, our method incorporates the multi-ACCDOA format to effectively handle homogeneous overlaps between sound events. During the evaluation, we conduct extensive experiments on three de facto benchmark datasets, and our results demonstrate that the proposed SELD method significantly outperforms current state-of-the-art approaches.

摘要

声音事件定位与检测(SELD)是机器听觉的一个关键组成部分,旨在同时在多通道音频记录中识别和定位声音事件。这项任务需要对空间、时间和频率域进行综合分析,以准确表征声音事件。空间域涉及多通道麦克风捕获的变化的声学信号,这对于确定声源位置至关重要。然而,最近的大多数研究分别集中在时频相关性和时空相关性上,导致在实际场景中的性能不足。在本文中,我们提出了一种新颖的SELD方法,该方法利用新开发的时空频率融合网络(STFF-Net)来联合学习声音事件在空间、时间和频率域的综合特征。我们的STFF-Net的主干是增强型3D(E3D)残差块,它将3D卷积与无参数注意力机制相结合,以捕获和细化这些域之间的复杂相关性。此外,我们的方法采用多ACCDOA格式来有效处理声音事件之间的均匀重叠。在评估过程中,我们在三个实际基准数据集上进行了广泛的实验,我们的结果表明,所提出的SELD方法明显优于当前的最先进方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d51f/11436190/e3f446f64d7d/sensors-24-06090-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验