[用于视听语音识别的多模态时间线索]

[Intermodal timing cues for audio-visual speech recognition].

作者信息

Hashimoto Masahiro, Kumashiro Masaharu

机构信息

Bio-information Research Center, University of Occupational and Environmental Health, Yahatanishi-ku, Kitakyushu 807-8555, Japan.

出版信息

J UOEH. 2004 Jun 1;26(2):215-25. doi: 10.7888/juoeh.26.215.

DOI:10.7888/juoeh.26.215

PMID:15244074

Abstract

The purpose of this study was to investigate the limitations of lip-reading advantages for Japanese young adults by desynchronizing visual and auditory information in speech. In the experiment, audio-visual speech stimuli were presented under the six test conditions: audio-alone, and audio-visually with either 0, 60, 120, 240 or 480 ms of audio delay. The stimuli were the video recordings of a face of a female Japanese speaking long and short Japanese sentences. The intelligibility of the audio-visual stimuli was measured as a function of audio delays in sixteen untrained young subjects. Speech intelligibility under the audio-delay condition of less than 120 ms was significantly better than that under the audio-alone condition. On the other hand, the delay of 120 ms corresponded to the mean mora duration measured for the audio stimuli. The results implied that audio delays of up to 120 ms would not disrupt lip-reading advantage, because visual and auditory information in speech seemed to be integrated on a syllabic time scale. Potential applications of this research include noisy workplace in which a worker must extract relevant speech from all the other competing noises.

摘要

本研究的目的是通过使语音中的视觉和听觉信息不同步，来探究日本年轻人唇读优势的局限性。在实验中，视听语音刺激在六种测试条件下呈现：仅音频，以及视听结合且音频延迟分别为0、60、120、240或480毫秒。刺激材料是一位讲日语的日本女性面部的视频记录，她说出了长、短日语句子。在16名未经训练的年轻受试者中，测量了视听刺激的可懂度作为音频延迟的函数。音频延迟小于120毫秒时的语音可懂度明显优于仅音频条件下的语音可懂度。另一方面，120毫秒的延迟对应于音频刺激测量的平均音拍持续时间。结果表明，高达120毫秒的音频延迟不会破坏唇读优势，因为语音中的视觉和听觉信息似乎在音节时间尺度上整合。本研究的潜在应用包括嘈杂的工作场所，在这种环境中，工人必须从所有其他竞争噪音中提取相关语音。