Suppr超能文献

语调与对话语境作为语音识别的制约因素。

Intonation and dialog context as constraints for speech recognition.

作者信息

Taylor P, King S, Isard S, Wright H

机构信息

Center for Speech Technology Research, University of Edinburgh, U.K.

出版信息

Lang Speech. 1998 Jul-Dec;41 ( Pt 3-4):493-512. doi: 10.1177/002383099804100411.

Abstract

This paper describes a way of using intonation and dialog context to improve the performance of an automatic speech recognition (ASR) system. Our experiments were run on the DCIEM Maptask corpus, a corpus of spontaneous task-oriented dialog speech. This corpus has been tagged according to a dialog analysis scheme that assigns each utterance to one of 12 "move types," such as "acknowledge," "query-yes/no" or "instruct." Most ASR systems use a bigram language model to constrain the possible sequences of words that might be recognized. Here we use a separate bigram language model for each move type. We show that when the "correct" move-specific language model is used for each utterance in the test set, the word error rate of the recognizer drops. Of course when the recognizer is run on previously unseen data, it cannot know in advance what move type the speaker has just produced. To determine the move type we use an intonation model combined with a dialog model that puts constraints on possible sequences of move types, as well as the speech recognizer likelihoods for the different move-specific models. In the full recognition system, the combination of automatic move type recognition with the move specific language models reduces the overall word error rate by a small but significant amount when compared with a baseline system that does not take intonation or dialog acts into account. Interestingly, the word error improvement is restricted to "initiating" move types, where word recognition is important. In "response" move types, where the important information is conveyed by the move type itself--for example, positive versus negative response--there is no word error improvement, but recognition of the response types themselves is good. The paper discusses the intonation model, the language models, and the dialog model in detail and describes the architecture in which they are combined.

摘要

本文描述了一种利用语调及对话上下文来提高自动语音识别(ASR)系统性能的方法。我们的实验是在DCIEM Maptask语料库上进行的,该语料库是一个面向任务的自发对话语音语料库。这个语料库已根据一种对话分析方案进行了标注,该方案将每个话语分配到12种“动作类型”中的一种,如“确认”“是/否询问”或“指示”。大多数ASR系统使用二元语言模型来约束可能被识别的单词序列。在这里,我们为每种动作类型使用一个单独的二元语言模型。我们表明,当在测试集中的每个话语使用“正确的”特定动作语言模型时,识别器的单词错误率会下降。当然,当识别器在之前未见过的数据上运行时,它无法提前知道说话者刚刚产生的动作类型。为了确定动作类型,我们使用一个语调模型与一个对话模型相结合,该对话模型对动作类型的可能序列以及不同特定动作模型的语音识别器似然性施加约束。在完整的识别系统中,与不考虑语调或对话行为的基线系统相比,自动动作类型识别与特定动作语言模型的结合使总体单词错误率有小幅但显著的降低。有趣的是,单词错误率的改善仅限于“发起”动作类型,在这些类型中单词识别很重要。在“回应”动作类型中,重要信息由动作类型本身传达——例如,肯定与否定回应——单词错误率没有改善,但对回应类型本身的识别很好。本文详细讨论了语调模型、语言模型和对话模型,并描述了它们相结合的架构。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验