Biomedical Engineering, University of Rochester, Rochester, NY, USA.
Center for Visual Science, University of Rochester, Rochester, NY, USA.
Trends Hear. 2023 Jan-Dec;27:23312165231207235. doi: 10.1177/23312165231207235.
Audiovisual integration of speech can benefit the listener by not only improving comprehension of what a talker is saying but also helping a listener select a particular talker's voice from a mixture of sounds. Binding, an early integration of auditory and visual streams that helps an observer allocate attention to a combined audiovisual object, is likely involved in processing audiovisual speech. Although temporal coherence of stimulus features across sensory modalities has been implicated as an important cue for non-speech stimuli (Maddox et al., 2015), the specific cues that drive binding in speech are not fully understood due to the challenges of studying binding in natural stimuli. Here we used speech-like artificial stimuli that allowed us to isolate three potential contributors to binding: temporal coherence (are the face and the voice changing synchronously?), articulatory correspondence (do visual faces represent the correct phones?), and talker congruence (do the face and voice come from the same person?). In a trio of experiments, we examined the relative contributions of each of these cues. Normal hearing listeners performed a dual task in which they were instructed to respond to events in a target auditory stream while ignoring events in a distractor auditory stream (auditory discrimination) and detecting flashes in a visual stream (visual detection). We found that viewing the face of a talker who matched the attended voice (i.e., talker congruence) offered a performance benefit. We found no effect of temporal coherence on performance in this task, prompting an important recontextualization of previous findings.
言语的视听整合不仅可以提高说话者的理解能力,还有助于听众从混合声音中选择特定说话者的声音。绑定是听觉和视觉流的早期整合,有助于观察者将注意力分配到组合的视听对象上,它可能参与了视听言语的处理。尽管跨感觉模态的刺激特征的时间连贯性已被认为是非言语刺激的重要线索(Maddox 等人,2015),但由于研究自然刺激中的绑定具有挑战性,因此尚未完全了解驱动言语绑定的具体线索。在这里,我们使用了类似言语的人工刺激,使我们能够分离出绑定的三个潜在贡献因素:时间连贯性(面部和声音是否同步变化?)、发音对应(视觉面孔是否代表正确的音素?)和说话者一致性(面部和声音是否来自同一个人?)。在三项实验中,我们检查了这些线索各自的相对贡献。正常听力的听众执行了一项双重任务,他们被指示在目标听觉流中响应事件,同时忽略干扰听觉流中的事件(听觉辨别)并检测视觉流中的闪光(视觉检测)。我们发现,观看与被注意声音匹配的说话者的面部(即说话者一致性)提供了性能优势。我们在这项任务中没有发现时间连贯性对性能的影响,这促使我们对先前的发现进行了重要的重新阐释。