Delfarah Masood, Wang DeLiang
Computer Science and Engineering, The Ohio State University, Columbus, OH, USA.
IEEE/ACM Trans Audio Speech Lang Process. 2019 Nov;27(11):1839-1848. doi: 10.1109/taslp.2019.2934319. Epub 2019 Aug 12.
Speaker separation refers to the problem of separating speech signals from a mixture of simultaneous speakers. Previous studies are limited to addressing the speaker separation problem in anechoic conditions. This paper addresses the problem of talker-dependent speaker separation in reverberant conditions, which are characteristic of real-world environments. We employ recurrent neural networks with bidirectional long short-term memory (BLSTM) to separate and dereverberate the target speech signal. We propose two-stage networks to effectively deal with both speaker separation and speech dereverberation. In the two-stage model, the first stage separates and dereverberates two-talker mixtures and the second stage further enhances the separated target signal. We have extensively evaluated the two-stage architecture, and our empirical results demonstrate large improvements over unprocessed mixtures and clear performance gain over single-stage networks in a wide range of target-to-interferer ratios and reverberation times in simulated as well as recorded rooms. Moreover, we show that time-frequency masking yields better performance than spectral mapping for reverberant speaker separation.
说话人分离是指从同时说话的混合语音信号中分离出各个语音信号的问题。以往的研究仅限于解决无回声条件下的说话人分离问题。本文研究了在混响环境中依赖于说话人的说话人分离问题,而混响环境是现实世界环境的特征。我们采用具有双向长短期记忆(BLSTM)的循环神经网络来分离目标语音信号并去除混响。我们提出了两阶段网络,以有效处理说话人分离和语音去混响问题。在两阶段模型中,第一阶段分离并去除两个说话人的混合语音信号的混响,第二阶段进一步增强分离出的目标信号。我们对两阶段架构进行了广泛评估,我们的实证结果表明,在模拟房间和录音房间中,在广泛的目标与干扰比和混响时间范围内,与未处理的混合语音相比有了很大改进,并且比单阶段网络有明显的性能提升。此外,我们表明,对于混响说话人分离,时频掩蔽比频谱映射具有更好的性能。