Chen Guorong, Yu Yuan, Qiao Yuan, Yang Junliang, Du Chongling, Qian Zhang, Huang Xiao
School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, No. 20, Daxuecheng East Road, Shapingba District, Chongqing 401331, China.
Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR 999077, China.
Sensors (Basel). 2024 Aug 18;24(16):5336. doi: 10.3390/s24165336.
Sound Event Detection and Localization (SELD) is a comprehensive task that aims to solve the subtasks of Sound Event Detection (SED) and Sound Source Localization (SSL) simultaneously. The task of SELD lies in the need to solve both sound recognition and spatial localization problems, and different categories of sound events may overlap in time and space, making it more difficult for the model to distinguish between different events occurring at the same time and to locate the sound source. In this study, the Dual-conv Coordinate Attention Module (DCAM) combines dual convolutional blocks and Coordinate Attention, and based on this, the network architecture based on the two-stage strategy is improved to form the SELD-oriented Two-Stage Dual-conv Coordinate Attention Model (TDCAM) for SELD. TDCAM draws on the concepts of Visual Geometry Group (VGG) networks and Coordinate Attention to effectively capture critical local information by focusing on the coordinate space information of the feature map and dealing with the relationship between the feature map channels to enhance the feature selection capability of the model. To address the limitation of a single-layer Bi-directional Gated Recurrent Unit (Bi-GRU) in the two-stage network in terms of timing processing, we add to the structure of the two-layer Bi-GRU and introduce the data enhancement techniques of the frequency mask and time mask to improve the modeling and generalization ability of the model for timing features. Through experimental validation on the TAU Spatial Sound Events 2019 development dataset, our approach significantly improves the performance of SELD compared to the two-stage network baseline model. Furthermore, the effectiveness of DCAM and the two-layer Bi-GRU structure is confirmed by performing ablation experiments.
声音事件检测与定位(SELD)是一项综合性任务,旨在同时解决声音事件检测(SED)和声源定位(SSL)的子任务。SELD的任务在于需要解决声音识别和空间定位问题,并且不同类别的声音事件可能在时间和空间上重叠,这使得模型更难以区分同时发生的不同事件并定位声源。在本研究中,双卷积坐标注意力模块(DCAM)结合了双卷积块和坐标注意力,并在此基础上改进了基于两阶段策略的网络架构,形成了面向SELD的两阶段双卷积坐标注意力模型(TDCAM)用于SELD。TDCAM借鉴了视觉几何组(VGG)网络和坐标注意力的概念,通过关注特征图的坐标空间信息并处理特征图通道之间的关系,有效地捕获关键局部信息,以增强模型的特征选择能力。为了解决两阶段网络中单层双向门控循环单元(Bi-GRU)在时间处理方面的局限性,我们在两层Bi-GRU的结构中加入,并引入频率掩码和时间掩码的数据增强技术,以提高模型对时间特征的建模和泛化能力。通过在TAU空间声音事件2019开发数据集上的实验验证,我们的方法与两阶段网络基线模型相比,显著提高了SELD的性能。此外,通过进行消融实验,证实了DCAM和两层Bi-GRU结构的有效性。