Suppr超能文献

三零:基于端到端冻结无声语音分离网络的零样本去噪和去混响。

Triple-0: Zero-shot denoising and dereverberation on an end-to-end frozen anechoic speech separation network.

机构信息

Department of Electrical Engineering, University of Engineering and Technology, Peshawar, Pakistan.

Intelligent Information Processing Lab, National Center of Artificial Intelligence, University of Engineering and Technology, Peshawar, Pakistan.

出版信息

PLoS One. 2024 Jul 16;19(7):e0301692. doi: 10.1371/journal.pone.0301692. eCollection 2024.

Abstract

Speech enhancement is crucial both for human and machine listening applications. Over the last decade, the use of deep learning for speech enhancement has resulted in tremendous improvement over the classical signal processing and machine learning methods. However, training a deep neural network is not only time-consuming; it also requires extensive computational resources and a large training dataset. Transfer learning, i.e. using a pretrained network for a new task, comes to the rescue by reducing the amount of training time, computational resources, and the required dataset, but the network still needs to be fine-tuned for the new task. This paper presents a novel method of speech denoising and dereverberation (SD&D) on an end-to-end frozen binaural anechoic speech separation network. The frozen network requires neither any architectural change nor any fine-tuning for the new task, as is usually required for transfer learning. The interaural cues of a source placed inside noisy and echoic surroundings are given as input to this pretrained network to extract the target speech from noise and reverberation. Although the pretrained model used in this paper has never seen noisy reverberant conditions during its training, it performs satisfactorily for zero-shot testing (ZST) under these conditions. It is because the pretrained model used here has been trained on the direct-path interaural cues of an active source and so it can recognize them even in the presence of echoes and noise. ZST on the same dataset on which the pretrained network was trained (homo-corpus) for the unseen class of interference, has shown considerable improvement over the weighted prediction error (WPE) algorithm in terms of four objective speech quality and intelligibility metrics. Also, the proposed model offers similar performance provided by a deep learning SD&D algorithm for this dataset under varying conditions of noise and reverberations. Similarly, ZST on a different dataset has provided an improvement in intelligibility and almost equivalent quality as provided by the WPE algorithm.

摘要

语音增强对于人类和机器听觉应用都至关重要。在过去的十年中,深度学习在语音增强方面的应用取得了巨大的进步,超越了传统的信号处理和机器学习方法。然而,训练一个深度神经网络不仅耗时,还需要大量的计算资源和大型训练数据集。迁移学习,即使用预训练的网络进行新任务,通过减少训练时间、计算资源和所需数据集的数量来提供帮助,但网络仍然需要针对新任务进行微调。本文提出了一种基于端到端冻结双耳无回声语音分离网络的语音去噪和去混响(SD&D)新方法。冻结网络既不需要任何架构更改,也不需要针对新任务进行微调,这是迁移学习通常需要的。将源的耳间线索置于嘈杂和混响环境中作为输入提供给这个预训练网络,以从噪声和混响中提取目标语音。尽管本文中使用的预训练模型在训练过程中从未见过嘈杂混响条件,但它在这些条件下的零镜头测试(ZST)中表现令人满意。这是因为这里使用的预训练模型是基于有源源的直达路径耳间线索进行训练的,因此即使在存在回声和噪声的情况下,它也可以识别这些线索。在预训练网络所训练的同一数据集上进行的零镜头测试(同语料库)对于看不见的干扰类别,在四个客观语音质量和可懂度指标方面,都比加权预测误差(WPE)算法有了相当大的改进。此外,对于该数据集,在不同的噪声和混响条件下,所提出的模型提供了与深度学习 SD&D 算法相似的性能。同样,在不同的数据集上进行的零镜头测试也提高了可懂度,并提供了与 WPE 算法相当的质量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c035/11251582/491ceb855f68/pone.0301692.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验