用于逐话语和连续语音分离的多麦克风复谱映射

Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation.

作者信息

Wang Zhong-Qiu, Wang Peidong, Wang DeLiang

机构信息

Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277 USA, while performing this work. He is now with Mitsubishi Electric Research Laboratories, Cambridge, MA 02139, USA.

Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277 USA.

出版信息

IEEE/ACM Trans Audio Speech Lang Process. 2021;29:2001-2014. doi: 10.1109/taslp.2021.3083405. Epub 2021 May 26.

DOI:10.1109/taslp.2021.3083405

PMID:34212067

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8240467/

Abstract

We propose multi-microphone complex spectral mapping, a simple way of applying deep learning for time-varying non-linear beamforming, for speaker separation in reverberant conditions. We aim at both speaker separation and dereverberation. Our study first investigates offline utterance-wise speaker separation and then extends to block-online continuous speech separation (CSS). Assuming a fixed array geometry between training and testing, we train deep neural networks (DNN) to predict the real and imaginary (RI) components of target speech at a reference microphone from the RI components of multiple microphones. We then integrate multi-microphone complex spectral mapping with minimum variance distortionless response (MVDR) beamforming and post-filtering to further improve separation, and combine it with frame-level speaker counting for block-online CSS. Although our system is trained on simulated room impulse responses (RIR) based on a fixed number of microphones arranged in a given geometry, it generalizes well to a real array with the same geometry. State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.

摘要

我们提出了多麦克风复谱映射方法，这是一种将深度学习应用于时变非线性波束形成的简单方法，用于在混响环境中进行说话人分离。我们旨在实现说话人分离和去混响。我们的研究首先研究离线逐话语说话人分离，然后扩展到块在线连续语音分离（CSS）。假设训练和测试之间的阵列几何形状固定，我们训练深度神经网络（DNN），根据多个麦克风的实部和虚部（RI）分量来预测参考麦克风处目标语音的实部和虚部（RI）分量。然后，我们将多麦克风复谱映射与最小方差无失真响应（MVDR）波束形成和后置滤波相结合，以进一步提高分离效果，并将其与帧级说话人计数相结合用于块在线CSS。尽管我们的系统是基于以给定几何形状排列的固定数量的麦克风在模拟房间脉冲响应（RIR）上进行训练的，但它能很好地推广到具有相同几何形状的真实阵列。在模拟的双说话人SMS-WSJ语料库和真实录制的LibriCSS数据集上获得了当前最优的分离性能。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于逐话语和连续语音分离的多麦克风复谱映射

Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

用于逐话语和连续语音分离的多麦克风复谱映射

Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献