一种带有注意力框架的卷积循环神经网络，用于单声道录音中的语音分离。

A convolutional recurrent neural network with attention framework for speech separation in monaural recordings.

机构信息

College of Electrical Engineering, Sichuan University, Chengdu, 610065, China.

Institute of Urban and Rural Planning and Design Zhejiang, Hangzhou, 310007, China.

出版信息

Sci Rep. 2021 Jan 14;11(1):1434. doi: 10.1038/s41598-020-80713-3.

DOI:10.1038/s41598-020-80713-3

PMID:33446851

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7809293/

Abstract

Most speech separation studies in monaural channel use only a single type of network, and the separation effect is typically not satisfactory, posing difficulties for high quality speech separation. In this study, we propose a convolutional recurrent neural network with an attention (CRNN-A) framework for speech separation, fusing advantages of two networks together. The proposed separation framework uses a convolutional neural network (CNN) as the front-end of a recurrent neural network (RNN), alleviating the problem that a sole RNN cannot effectively learn the necessary features. This framework makes use of the translation invariance provided by CNN to extract information without modifying the original signals. Within the supplemented CNN, two different convolution kernels are designed to capture information in both the time and frequency domains of the input spectrogram. After concatenating the time-domain and the frequency-domain feature maps, the feature information of speech is exploited through consecutive convolutional layers. Finally, the feature map learned from the front-end CNN is combined with the original spectrogram and is sent to the back-end RNN. Further, the attention mechanism is further incorporated, focusing on the relationship among different feature maps. The effectiveness of the proposed method is evaluated on the standard dataset MIR-1K and the results prove that the proposed method outperforms the baseline RNN and other popular speech separation methods, in terms of GNSDR (gloabl normalised source-to-distortion ratio), GSIR (global source-to-interferences ratio), and GSAR (gloabl source-to-artifacts ratio). In summary, the proposed CRNN-A framework can effectively combine the advantages of CNN and RNN, and further optimise the separation performance via the attention mechanism. The proposed framework can shed a new light on speech separation, speech enhancement, and other related fields.

摘要

大多数单声道通道中的语音分离研究仅使用单一类型的网络，并且分离效果通常不理想，这给高质量的语音分离带来了困难。在本研究中，我们提出了一种带有注意力机制的卷积递归神经网络（CRNN-A）框架，用于语音分离，融合了两种网络的优势。所提出的分离框架使用卷积神经网络（CNN）作为递归神经网络（RNN）的前端，缓解了单一 RNN 无法有效学习必要特征的问题。该框架利用 CNN 提供的平移不变性来提取信息，而无需修改原始信号。在补充的 CNN 中，设计了两个不同的卷积核来捕获输入声谱图的时域和频域信息。在串联时间和频率域特征图后，通过连续卷积层利用语音的特征信息。最后，将前端 CNN 学习到的特征图与原始声谱图相结合，并发送到后端 RNN。此外，进一步引入了注意力机制，关注不同特征图之间的关系。在所提出的方法的有效性评估中，使用了标准数据集 MIR-1K，结果证明，在所提出的方法在 GNSDR（全局归一化源到失真比）、GSIR（全局源到干扰比）和 GSAR（全局源到伪影比）方面优于基线 RNN 和其他流行的语音分离方法。总之，所提出的 CRNN-A 框架可以有效地结合 CNN 和 RNN 的优势，并通过注意力机制进一步优化分离性能。所提出的框架可以为语音分离、语音增强和其他相关领域提供新的思路。

相似文献

A convolutional recurrent neural network with attention framework for speech separation in monaural recordings.

Sci Rep. 2021 Jan 14;11(1):1434. doi: 10.1038/s41598-020-80713-3.

Speech emotion recognition based on improved masking EMD and convolutional recurrent neural network.

Front Psychol. 2023 Jan 9;13:1075624. doi: 10.3389/fpsyg.2022.1075624. eCollection 2022.

Convolutional fusion network for monaural speech enhancement.

Neural Netw. 2021 Nov;143:97-107. doi: 10.1016/j.neunet.2021.05.017. Epub 2021 May 25.

ECG-based multi-class arrhythmia detection using spatio-temporal attention-based convolutional recurrent neural network.

Artif Intell Med. 2020 Jun;106:101856. doi: 10.1016/j.artmed.2020.101856. Epub 2020 May 11.

ACR-SA: attention-based deep model through two-channel CNN and Bi-RNN for sentiment analysis.

PeerJ Comput Sci. 2022 Mar 17;8:e877. doi: 10.7717/peerj-cs.877. eCollection 2022.

A Classroom Emotion Recognition Model Based on a Convolutional Neural Network Speech Emotion Algorithm.

Occup Ther Int. 2022 Jul 7;2022:9563877. doi: 10.1155/2022/9563877. eCollection 2022.

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition.

Neural Netw. 2021 Sep;141:52-60. doi: 10.1016/j.neunet.2021.03.013. Epub 2021 Mar 23.

EEG driving fatigue detection based on log-Mel spectrogram and convolutional recurrent neural networks.

Front Neurosci. 2023 Mar 9;17:1136609. doi: 10.3389/fnins.2023.1136609. eCollection 2023.

Multichannel Two-Dimensional Convolutional Neural Network Based on Interactive Features and Group Strategy for Chinese Sentiment Analysis.

Sensors (Basel). 2022 Jan 18;22(3):714. doi: 10.3390/s22030714.

Automated AJCC (7th edition) staging of non-small cell lung cancer (NSCLC) using deep convolutional neural network (CNN) and recurrent neural network (RNN).

Health Inf Sci Syst. 2019 Jul 30;7(1):14. doi: 10.1007/s13755-019-0077-1. eCollection 2019 Dec.

引用本文的文献

Mixed T-domain and TF-domain Magnitude and Phase representations for GAN-based speech enhancement.

Sci Rep. 2024 Jul 31;14(1):17698. doi: 10.1038/s41598-024-68708-w.

Accelerated four-dimensional free-breathing whole-liver water-fat magnetic resonance imaging with deep dictionary learning and chemical shift modeling.

Quant Imaging Med Surg. 2024 Apr 3;14(4):2884-2903. doi: 10.21037/qims-23-1396. Epub 2024 Mar 25.

Design of G Protein-Coupled Receptor 40 Peptide Agonists for Type 2 Diabetes Mellitus Based on Artificial Intelligence and Site-Directed Mutagenesis.

Front Bioeng Biotechnol. 2021 Jun 14;9:694100. doi: 10.3389/fbioe.2021.694100. eCollection 2021.

本文引用的文献

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation.

IEEE/ACM Trans Audio Speech Lang Process. 2019 Aug;27(8):1256-1266. doi: 10.1109/TASLP.2019.2915167. Epub 2019 May 6.

Supervised Speech Separation Based on Deep Learning: An Overview.

IEEE/ACM Trans Audio Speech Lang Process. 2018 Oct;26(10):1702-1726. doi: 10.1109/TASLP.2018.2842159. Epub 2018 May 30.

On Training Targets for Supervised Speech Separation.

IEEE/ACM Trans Audio Speech Lang Process. 2014 Dec;22(12):1849-1858. doi: 10.1109/TASLP.2014.2352935.

Time-frequency masking for speech separation and its potential for hearing aid design.

Trends Amplif. 2008 Dec;12(4):332-53. doi: 10.1177/1084713808326455. Epub 2008 Oct 30.

Learning to forget: continual prediction with LSTM.

Neural Comput. 2000 Oct;12(10):2451-71. doi: 10.1162/089976600300015015.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种带有注意力框架的卷积循环神经网络，用于单声道录音中的语音分离。

A convolutional recurrent neural network with attention framework for speech separation in monaural recordings.

机构信息

College of Electrical Engineering, Sichuan University, Chengdu, 610065, China.

Institute of Urban and Rural Planning and Design Zhejiang, Hangzhou, 310007, China.

出版信息

Sci Rep. 2021 Jan 14;11(1):1434. doi: 10.1038/s41598-020-80713-3.

DOI:10.1038/s41598-020-80713-3

PMID:33446851

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7809293/

Abstract

摘要

一种带有注意力框架的卷积循环神经网络，用于单声道录音中的语音分离。

A convolutional recurrent neural network with attention framework for speech separation in monaural recordings.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

一种带有注意力框架的卷积循环神经网络，用于单声道录音中的语音分离。

A convolutional recurrent neural network with attention framework for speech separation in monaural recordings.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献