在二元互动中利用语言语境来改进儿童自动语音识别

Leveraging Linguistic Context in Dyadic Interactions to Improve Automatic Speech Recognition for Children.

作者信息

Kumar Manoj, Kim So Hyun, Lord Catherine, Lyon Thomas D, Narayanan Shrikanth

机构信息

Signal Analysis and Interpretation Lab, University of Southern California.

Center for Autism and the Developing Brain, Weill Cornell Medicine.

出版信息

Comput Speech Lang. 2020 Sep;63. doi: 10.1016/j.csl.2020.101101. Epub 2020 Apr 16.

DOI:10.1016/j.csl.2020.101101

PMID:32431473

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7236760/

Abstract

Automatic speech recognition for child speech has been long considered a more challenging problem than for adult speech. Various contributing factors have been identified such as larger acoustic speech variability including mispronunciations due to continuing biological changes in growth, developing vocabulary and linguistic skills, and scarcity of training corpora. A further challenge arises when dealing with spontaneous speech of children involved in a conversational interaction, and especially when the child may have limited or impaired communication ability. This includes health applications, one of the motivating domains of this paper, that involve goal-oriented dyadic interactions between a child and clinician/adult social partner as a part of behavioral assessment. In this work, we use linguistic context information from the interaction to adapt speech recognition models for children speech. Specifically, spoken language from the interacting adult speech provides the context for the child's speech. We propose two methods to exploit this context: lexical repetitions and semantic response generation. For the latter, we make use of sequence-to-sequence models that learn to predict the target child utterance given context adult utterances. Long-term context is incorporated in the model by propagating the cell-state across the duration of conversation. We use interpolation techniques to adapt language models at the utterance level, and analyze the effect of length and direction of context (forward and backward). Two different domains are used in our experiments to demonstrate the generalized nature of our methods - interactions between a child with ASD and an adult social partner in a play-based, naturalistic setting, and in forensic interviews between a child and a trained interviewer. In both cases, context-adapted models yield significant improvement (upto 10.71% in absolute word error rate) over the baseline and perform consistently across context windows and directions. Using statistical analysis, we investigate the effect of source-based (adult) and target-based (child) factors on adaptation methods. Our results demonstrate the applicability of our modeling approach in improving child speech recognition by employing information transfer from the adult interlocutor.

摘要

长期以来，儿童语音的自动语音识别一直被认为是一个比成人语音更具挑战性的问题。已经确定了各种促成因素，例如更大的声学语音变异性，包括由于生长过程中持续的生理变化、词汇和语言技能发展导致的发音错误，以及训练语料库的稀缺性。在处理参与对话互动的儿童的自发语音时，尤其是当儿童的沟通能力可能有限或受损时，会出现进一步的挑战。这包括健康应用，这是本文的一个激励领域，涉及儿童与临床医生/成人社交伙伴之间作为行为评估一部分的目标导向二元互动。在这项工作中，我们利用互动中的语言上下文信息来调整儿童语音的语音识别模型。具体来说，来自互动成人语音的口语为儿童语音提供了上下文。我们提出了两种利用这种上下文的方法：词汇重复和语义响应生成。对于后者，我们使用序列到序列模型，该模型学习根据上下文成人话语预测目标儿童话语。通过在对话持续时间内传播单元状态，将长期上下文纳入模型。我们使用插值技术在话语级别调整语言模型，并分析上下文长度和方向（向前和向后）的影响。我们的实验使用了两个不同的领域来证明我们方法的通用性——患有自闭症谱系障碍（ASD）的儿童与成人社交伙伴在基于游戏的自然环境中的互动，以及儿童与训练有素的采访者之间的法医访谈。在这两种情况下，上下文适应模型相对于基线都有显著改进（绝对单词错误率高达10.71%），并且在上下文窗口和方向上表现一致。通过统计分析，我们研究了基于源（成人）和基于目标（儿童）的因素对适应方法的影响。我们的结果证明了我们的建模方法通过采用来自成人对话者的信息传递来改进儿童语音识别的适用性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

在二元互动中利用语言语境来改进儿童自动语音识别

Leveraging Linguistic Context in Dyadic Interactions to Improve Automatic Speech Recognition for Children.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

在二元互动中利用语言语境来改进儿童自动语音识别

Leveraging Linguistic Context in Dyadic Interactions to Improve Automatic Speech Recognition for Children.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献