Ministry of Education Key Laboratory of Cognitive Radio and Information Processing, Guilin 541006, China.
School of Information and Communication, Guilin University of Electronic Technology, Guilin 541006, China.
Sensors (Basel). 2022 Sep 9;22(18):6818. doi: 10.3390/s22186818.
The complexity of polyphonic sounds imposes numerous challenges on their classification. Especially in real life, polyphonic sound events have discontinuity and unstable time-frequency variations. Traditional single acoustic features cannot characterize the key feature information of the polyphonic sound event, and this deficiency results in poor model classification performance. In this paper, we propose a convolutional recurrent neural network model based on the temporal-frequency (TF) attention mechanism and feature space (FS) attention mechanism (TFFS-CRNN). The TFFS-CRNN model aggregates Log-Mel spectrograms and MFCCs feature as inputs, which contains the TF-attention module, the convolutional recurrent neural network (CRNN) module, the FS-attention module and the bidirectional gated recurrent unit (BGRU) module. In polyphonic sound events detection (SED), the TF-attention module can capture the critical temporal-frequency features more capably. The FS-attention module assigns different dynamically learnable weights to different dimensions of features. The TFFS-CRNN model improves the characterization of features for key feature information in polyphonic SED. By using two attention modules, the model can focus on semantically relevant time frames, key frequency bands, and important feature spaces. Finally, the BGRU module learns contextual information. The experiments were conducted on the DCASE 2016 Task3 dataset and the DCASE 2017 Task3 dataset. Experimental results show that the F1-score of the TFFS-CRNN model improved 12.4% and 25.2% compared with winning system models in DCASE challenge; the ER is reduced by 0.41 and 0.37 as well. The proposed TFFS-CRNN model algorithm has better classification performance and lower ER in polyphonic SED.
复调声音的复杂性对其分类提出了诸多挑战。特别是在现实生活中,复调声音事件具有不连续性和时频变化不稳定的特点。传统的单一声学特征无法表征复调声音事件的关键特征信息,这导致模型分类性能较差。在本文中,我们提出了一种基于时频(TF)注意力机制和特征空间(FS)注意力机制的卷积循环神经网络模型(TFFS-CRNN)。TFFS-CRNN 模型聚合 Log-Mel 频谱图和 MFCC 特征作为输入,包含 TF-attention 模块、卷积循环神经网络(CRNN)模块、FS-attention 模块和双向门控循环单元(BGRU)模块。在复调声音事件检测(SED)中,TF-attention 模块能够更有效地捕捉关键时频特征。FS-attention 模块为不同的特征维度分配不同的动态可学习权重。TFFS-CRNN 模型提高了对复调 SED 中关键特征信息的特征刻画能力。通过使用两个注意力模块,模型可以关注语义相关的时间帧、关键频带和重要的特征空间。最后,BGRU 模块学习上下文信息。实验在 DCASE 2016 Task3 数据集和 DCASE 2017 Task3 数据集上进行。实验结果表明,与 DCASE 挑战赛中的获奖系统模型相比,TFFS-CRNN 模型的 F1 分数提高了 12.4%和 25.2%,错误率分别降低了 0.41 和 0.37。所提出的 TFFS-CRNN 模型算法在复调 SED 中具有更好的分类性能和更低的错误率。