Medizinische Physik, Universitat Oldenburg, D-26111 Oldenburg, Germany.
J Acoust Soc Am. 2009 Nov;126(5):2635-48. doi: 10.1121/1.3224721.
This study compares the phoneme recognition performance in speech-shaped noise of a microscopic model for speech recognition with the performance of normal-hearing listeners. "Microscopic" is defined in terms of this model twofold. First, the speech recognition rate is predicted on a phoneme-by-phoneme basis. Second, microscopic modeling means that the signal waveforms to be recognized are processed by mimicking elementary parts of human's auditory processing. The model is based on an approach by Holube and Kollmeier [J. Acoust. Soc. Am. 100, 1703-1716 (1996)] and consists of a psychoacoustically and physiologically motivated preprocessing and a simple dynamic-time-warp speech recognizer. The model is evaluated while presenting nonsense speech in a closed-set paradigm. Averaged phoneme recognition rates, specific phoneme recognition rates, and phoneme confusions are analyzed. The influence of different perceptual distance measures and of the model's a-priori knowledge is investigated. The results show that human performance can be predicted by this model using an optimal detector, i.e., identical speech waveforms for both training of the recognizer and testing. The best model performance is yielded by distance measures which focus mainly on small perceptual distances and neglect outliers.
本研究比较了语音识别微观模型与正常听力者在语音噪声中的音位识别性能。“微观”在该模型中有两方面的定义。首先,语音识别率是逐音位预测的。其次,微观建模意味着要识别的信号波形通过模仿人类听觉处理的基本部分进行处理。该模型基于 Holube 和 Kollmeier 的方法[J. Acoust. Soc. Am. 100, 1703-1716 (1996)],由一个具有心理声学和生理学动机的预处理和一个简单的动态时间扭曲语音识别器组成。该模型在闭集范式中呈现无意义语音时进行评估。分析了平均音位识别率、特定音位识别率和音位混淆。研究了不同感知距离度量和模型先验知识的影响。结果表明,使用最优检测器可以通过该模型预测人类性能,即训练识别器和测试的语音波形完全相同。最佳的模型性能由主要关注小感知距离且忽略异常值的距离度量产生。