College of Electrical Engineering, Sichuan University, Chengdu, 610065, China.
Institute of Urban and Rural Planning and Design Zhejiang, Hangzhou, 310007, China.
Sci Rep. 2021 Jan 14;11(1):1434. doi: 10.1038/s41598-020-80713-3.
Most speech separation studies in monaural channel use only a single type of network, and the separation effect is typically not satisfactory, posing difficulties for high quality speech separation. In this study, we propose a convolutional recurrent neural network with an attention (CRNN-A) framework for speech separation, fusing advantages of two networks together. The proposed separation framework uses a convolutional neural network (CNN) as the front-end of a recurrent neural network (RNN), alleviating the problem that a sole RNN cannot effectively learn the necessary features. This framework makes use of the translation invariance provided by CNN to extract information without modifying the original signals. Within the supplemented CNN, two different convolution kernels are designed to capture information in both the time and frequency domains of the input spectrogram. After concatenating the time-domain and the frequency-domain feature maps, the feature information of speech is exploited through consecutive convolutional layers. Finally, the feature map learned from the front-end CNN is combined with the original spectrogram and is sent to the back-end RNN. Further, the attention mechanism is further incorporated, focusing on the relationship among different feature maps. The effectiveness of the proposed method is evaluated on the standard dataset MIR-1K and the results prove that the proposed method outperforms the baseline RNN and other popular speech separation methods, in terms of GNSDR (gloabl normalised source-to-distortion ratio), GSIR (global source-to-interferences ratio), and GSAR (gloabl source-to-artifacts ratio). In summary, the proposed CRNN-A framework can effectively combine the advantages of CNN and RNN, and further optimise the separation performance via the attention mechanism. The proposed framework can shed a new light on speech separation, speech enhancement, and other related fields.
大多数单声道通道中的语音分离研究仅使用单一类型的网络,并且分离效果通常不理想,这给高质量的语音分离带来了困难。在本研究中,我们提出了一种带有注意力机制的卷积递归神经网络(CRNN-A)框架,用于语音分离,融合了两种网络的优势。所提出的分离框架使用卷积神经网络(CNN)作为递归神经网络(RNN)的前端,缓解了单一 RNN 无法有效学习必要特征的问题。该框架利用 CNN 提供的平移不变性来提取信息,而无需修改原始信号。在补充的 CNN 中,设计了两个不同的卷积核来捕获输入声谱图的时域和频域信息。在串联时间和频率域特征图后,通过连续卷积层利用语音的特征信息。最后,将前端 CNN 学习到的特征图与原始声谱图相结合,并发送到后端 RNN。此外,进一步引入了注意力机制,关注不同特征图之间的关系。在所提出的方法的有效性评估中,使用了标准数据集 MIR-1K,结果证明,在所提出的方法在 GNSDR(全局归一化源到失真比)、GSIR(全局源到干扰比)和 GSAR(全局源到伪影比)方面优于基线 RNN 和其他流行的语音分离方法。总之,所提出的 CRNN-A 框架可以有效地结合 CNN 和 RNN 的优势,并通过注意力机制进一步优化分离性能。所提出的框架可以为语音分离、语音增强和其他相关领域提供新的思路。