Speech Technology and Research Laboratory, SRI International, Menlo Park, California 94025, USA.
J Acoust Soc Am. 2012 Mar;131(3):2270-87. doi: 10.1121/1.3682038.
Studies have shown that supplementary articulatory information can help to improve the recognition rate of automatic speech recognition systems. Unfortunately, articulatory information is not directly observable, necessitating its estimation from the speech signal. This study describes a system that recognizes articulatory gestures from speech, and uses the recognized gestures in a speech recognition system. Recognizing gestures for a given utterance involves recovering the set of underlying gestural activations and their associated dynamic parameters. This paper proposes a neural network architecture for recognizing articulatory gestures from speech and presents ways to incorporate articulatory gestures for a digit recognition task. The lack of natural speech database containing gestural information prompted us to use three stages of evaluation. First, the proposed gestural annotation architecture was tested on a synthetic speech dataset, which showed that the use of estimated tract-variable-time-functions improved gesture recognition performance. In the second stage, gesture-recognition models were applied to natural speech waveforms and word recognition experiments revealed that the recognized gestures can improve the noise-robustness of a word recognition system. In the final stage, a gesture-based Dynamic Bayesian Network was trained and the results indicate that incorporating gestural information can improve word recognition performance compared to acoustic-only systems.
研究表明,补充发音信息可以帮助提高自动语音识别系统的识别率。然而,发音信息无法直接观察,因此需要从语音信号中进行估计。本研究描述了一种从语音中识别发音动作的系统,并在语音识别系统中使用所识别的动作。为给定的话语识别动作涉及恢复潜在的动作激活集及其相关的动态参数。本文提出了一种从语音中识别发音动作的神经网络架构,并提出了将发音动作纳入数字识别任务的方法。由于缺乏包含手势信息的自然语音数据库,我们采用了三个阶段的评估。首先,在合成语音数据集上测试了所提出的手势标注架构,结果表明使用估计的声道变量时间函数可以提高手势识别性能。在第二阶段,将手势识别模型应用于自然语音波形,词识别实验表明,所识别的手势可以提高词识别系统对噪声的鲁棒性。在最后阶段,训练了基于手势的动态贝叶斯网络,结果表明与仅基于声学的系统相比,结合手势信息可以提高词识别性能。