Jesse Alexandra
Department of Psychological and Brain Sciences, University of Massachusetts, 135 Hicks Way, Amherst, MA, 01003, USA.
Atten Percept Psychophys. 2025 Apr;87(3):936-951. doi: 10.3758/s13414-025-03049-y. Epub 2025 Mar 20.
How speech is realized varies across talkers but can be somewhat consistent within a talker. Humans are sensitive to these idiosyncrasies when perceiving auditory speech, but also, in face-to-face communications, when perceiving their visual speech. Our recent work has shown that humans can also use talker idiosyncrasies seen in how talkers produce sentences to rapidly learn to recognize unfamiliar talkers, suggesting that visual speech information can be used for speech perception and talker recognition. However, in learning from sentences, learners may focus only on global information about the talker, such as talker-specific realizations of prosody and rate. The present study tested whether human perceivers can learn the identity of the talker based solely on fine-phonetic detail in the dynamic realization of visual speech alone. Participants learned to identify talkers from point-light displays showing them uttering isolated words. These point-light displays isolated the dynamic speech information, while discarding static information about the talker's face. No sound was presented. Feedback was given only during training. Test included point-light displays of familiar words from training and of novel words. Participants learned to recognize two and four talkers from the word-level dynamics of visual speech from very little exposure. The established representations allowed talker recognition independent of linguistic content-that is, even from novel words. Spoken words therefore contain sufficient indexical information in their fine-phonetic detail for perceivers to acquire dynamic facial representations for unfamiliar talkers that allows generalization across words. Dynamic representations of talking faces are formed for the recognition of unfamiliar faces.
语音的实现方式因说话者而异,但在同一个说话者身上可能会有一定的一致性。人类在感知听觉语音时对这些特质很敏感,而且在面对面交流中感知视觉语音时也是如此。我们最近的研究表明,人类还可以利用说话者造句方式中体现出的特质来快速学会识别不熟悉的说话者,这表明视觉语音信息可用于语音感知和说话者识别。然而,在从句子中学习时,学习者可能只关注关于说话者的全局信息,比如说话者特有的韵律和语速表现。本研究测试了人类感知者是否能够仅基于视觉语音动态实现中的精细语音细节来学习说话者的身份。参与者通过观看点光源显示来学习识别说话者,这些显示呈现的是他们说出单个单词的画面。这些点光源显示隔离了动态语音信息,同时摒弃了关于说话者面部的静态信息。没有播放声音。仅在训练期间提供反馈。测试包括来自训练的熟悉单词和新单词的点光源显示。参与者通过极少的接触,从视觉语音的单词级动态中学会识别两名和四名说话者。所建立的表征使得说话者识别能够独立于语言内容——也就是说,即使是对于新单词也能识别。因此,口语单词在其精细语音细节中包含足够的索引信息,让感知者能够为不熟悉的说话者获取动态面部表征,从而实现跨单词的泛化。为了识别不熟悉的面孔,会形成说话面孔的动态表征。