Shao Xu, Milner Ben
School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, United Kingdom.
J Acoust Soc Am. 2005 Aug;118(2):1134-43. doi: 10.1121/1.1953269.
This work proposes a method to reconstruct an acoustic speech signal solely from a stream of mel-frequency cepstral coefficients (MFCCs) as may be encountered in a distributed speech recognition (DSR) system. Previous methods for speech reconstruction have required, in addition to the MFCC vectors, fundamental frequency and voicing components. In this work the voicing classification and fundamental frequency are predicted from the MFCC vectors themselves using two maximum a posteriori (MAP) methods. The first method enables fundamental frequency prediction by modeling the joint density of MFCCs and fundamental frequency using a single Gaussian mixture model (GMM). The second scheme uses a set of hidden Markov models (HMMs) to link together a set of state-dependent GMMs, which enables a more localized modeling of the joint density of MFCCs and fundamental frequency. Experimental results on speaker-independent male and female speech show that accurate voicing classification and fundamental frequency prediction is attained when compared to hand-corrected reference fundamental frequency measurements. The use of the predicted fundamental frequency and voicing for speech reconstruction is shown to give very similar speech quality to that obtained using the reference fundamental frequency and voicing.
这项工作提出了一种仅从分布式语音识别(DSR)系统中可能遇到的梅尔频率倒谱系数(MFCC)流重建声学语音信号的方法。以往的语音重建方法除了需要MFCC向量外,还需要基频和清音浊音成分。在这项工作中,使用两种最大后验(MAP)方法从MFCC向量本身预测清音浊音分类和基频。第一种方法通过使用单个高斯混合模型(GMM)对MFCC和基频的联合密度进行建模来实现基频预测。第二种方案使用一组隐马尔可夫模型(HMM)将一组与状态相关的GMM链接在一起,从而能够对MFCC和基频的联合密度进行更局部的建模。与人工校正的参考基频测量结果相比,针对独立于说话者的男性和女性语音的实验结果表明,能够实现准确的清音浊音分类和基频预测。结果表明,使用预测的基频和清音浊音进行语音重建,所得到的语音质量与使用参考基频和清音浊音所获得的语音质量非常相似。