Chen Jitong, Wang DeLiang
Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio 43210, USA.
J Acoust Soc Am. 2017 Jun;141(6):4705. doi: 10.1121/1.4986931.
Speech separation can be formulated as learning to estimate a time-frequency mask from acoustic features extracted from noisy speech. For supervised speech separation, generalization to unseen noises and unseen speakers is a critical issue. Although deep neural networks (DNNs) have been successful in noise-independent speech separation, DNNs are limited in modeling a large number of speakers. To improve speaker generalization, a separation model based on long short-term memory (LSTM) is proposed, which naturally accounts for temporal dynamics of speech. Systematic evaluation shows that the proposed model substantially outperforms a DNN-based model on unseen speakers and unseen noises in terms of objective speech intelligibility. Analyzing LSTM internal representations reveals that LSTM captures long-term speech contexts. It is also found that the LSTM model is more advantageous for low-latency speech separation and it, without future frames, performs better than the DNN model with future frames. The proposed model represents an effective approach for speaker- and noise-independent speech separation.
语音分离可以被表述为学习从有噪声语音中提取的声学特征来估计时频掩码。对于有监督的语音分离,泛化到未见过的噪声和未见过的说话者是一个关键问题。尽管深度神经网络(DNN)在与噪声无关的语音分离方面取得了成功,但DNN在对大量说话者进行建模时存在局限性。为了提高说话者泛化能力,提出了一种基于长短期记忆(LSTM)的分离模型,该模型自然地考虑了语音的时间动态。系统评估表明,在客观语音可懂度方面,所提出的模型在未见过的说话者和未见过的噪声上显著优于基于DNN的模型。对LSTM内部表示的分析表明,LSTM捕捉到了长期语音上下文。还发现LSTM模型在低延迟语音分离方面更具优势,并且在没有未来帧的情况下,其性能优于有未来帧的DNN模型。所提出的模型代表了一种用于与说话者和噪声无关的语音分离的有效方法。