Darch Jonathan, Milner Ben, Vaseghi Saeed
School of Computing Sciences, University of East Anglia, Norwich, United Kingdom.
J Acoust Soc Am. 2008 Dec;124(6):3989-4000. doi: 10.1121/1.2997436.
The aim of this work is to develop methods that enable acoustic speech features to be predicted from mel-frequency cepstral coefficient (MFCC) vectors as may be encountered in distributed speech recognition architectures. The work begins with a detailed analysis of the multiple correlation between acoustic speech features and MFCC vectors. This confirms the existence of correlation, which is found to be higher when measured within specific phonemes rather than globally across all speech sounds. The correlation analysis leads to the development of a statistical method of predicting acoustic speech features from MFCC vectors that utilizes a network of hidden Markov models (HMMs) to localize prediction to specific phonemes. Within each HMM, the joint density of acoustic features and MFCC vectors is modeled and used to make a maximum a posteriori prediction. Experimental results are presented across a range of conditions, such as with speaker-dependent, gender-dependent, and gender-independent constraints, and these show that acoustic speech features can be predicted from MFCC vectors with good accuracy. A comparison is also made against an alternative scheme that substitutes the higher-order MFCCs with acoustic features for transmission. This delivers accurate acoustic features but at the expense of a significant reduction in speech recognition accuracy.
这项工作的目的是开发一些方法,以便能够从分布式语音识别架构中可能遇到的梅尔频率倒谱系数(MFCC)向量预测声学语音特征。这项工作首先对声学语音特征与MFCC向量之间的多重相关性进行了详细分析。这证实了相关性的存在,发现在特定音素内测量时相关性更高,而不是在所有语音的全局范围内测量。相关性分析导致了一种从MFCC向量预测声学语音特征的统计方法的发展,该方法利用隐马尔可夫模型(HMM)网络将预测定位到特定音素。在每个HMM中,对声学特征和MFCC向量的联合密度进行建模,并用于进行最大后验预测。给出了在一系列条件下的实验结果,例如在与说话者相关、与性别相关和与性别无关的约束条件下,这些结果表明可以从MFCC向量中准确地预测声学语音特征。还与一种替代方案进行了比较,该方案用声学特征替代高阶MFCC进行传输。这能提供准确的声学特征,但代价是语音识别准确率显著降低。