Uchanski R M, Delhorne L A, Dix A K, Braida L D, Reed C M, Durlach N I
Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge 02139.
J Rehabil Res Dev. 1994;31(1):20-41.
Although great strides have been made in the development of automatic speech recognition (ASR) systems, the communication performance achievable with the output of current real-time speech recognition systems would be extremely poor relative to normal speech reception. An alternate application of ASR technology to aid the hearing impaired would derive cues from the acoustical speech signal that could be used to supplement speechreading. We report a study of highly trained receivers of Manual Cued Speech that indicates that nearly perfect reception of everyday connected speech materials can be achieved at near normal speaking rates. To understand the accuracy that might be achieved with automatically generated cues, we measured how well trained spectrogram readers and an automatic speech recognizer could assign cues for various cue systems. We then applied a recently developed model of audiovisual integration to these recognizer measurements and data on human recognition of consonant and vowel segments via speechreading to evaluate the benefit to speechreading provided by such cues. Our analysis suggests that with cues derived from current recognizers, consonant and vowel segments can be received with accuracies in excess of 80%. This level of performance is roughly equivalent to the segment reception accuracy required to account for observed levels of Manual Cued Speech reception. Current recognizers provide maximal benefit by generating only a relatively small number (three to five) of cue groups, and may not provide substantially greater aid to speechreading than simpler aids that do not incorporate discrete phonetic recognition. To provide guidance for the development of improved automatic cueing systems, we describe techniques for determining optimum cue groups for a given recognizer and speechreader, and estimate the cueing performance that might be achieved if the performance of current recognizers were improved.
尽管自动语音识别(ASR)系统的发展已经取得了长足的进步,但相对于正常的语音接收而言,当前实时语音识别系统输出所实现的通信性能仍将极其糟糕。将ASR技术用于帮助听力受损者的另一种应用是从声学语音信号中提取线索,以补充唇读。我们报告了一项针对经过高度训练的手语语音接收者的研究,该研究表明,以接近正常的语速几乎可以完美接收日常连贯的语音材料。为了了解自动生成的线索可能达到的准确性,我们测量了训练有素的声谱图阅读者和自动语音识别器为各种线索系统分配线索的能力。然后,我们将最近开发的视听整合模型应用于这些识别器测量结果以及人类通过唇读识别辅音和元音片段的数据,以评估此类线索对唇读的帮助。我们的分析表明,利用当前识别器生成的线索,辅音和元音片段的接收准确率可以超过80%。这种性能水平大致相当于解释观察到的手语语音接收水平所需的片段接收准确率。当前的识别器通过仅生成相对较少数量(三到五个)的线索组来提供最大的帮助,并且可能不会比不包含离散语音识别的更简单辅助工具为唇读提供实质上更大的帮助。为了为改进的自动提示系统的开发提供指导,我们描述了为给定的识别器和唇读者确定最佳线索组的技术,并估计了如果当前识别器的性能得到改善可能实现的提示性能。