Ma Haotian, Wang Zhengjia, Zhang Xiang, Magnotti John F, Beauchamp Michael S
Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA.
bioRxiv. 2025 Aug 24:2025.08.20.671347. doi: 10.1101/2025.08.20.671347.
In the McGurk effect, incongruent auditory and visual syllables are perceived as a third, illusory syllable. The prevailing explanation for the effect is that the illusory syllable is a consensus percept intermediate between otherwise incompatible auditory and visual representations. To test this idea, we turned to a deep neural network known as AVHuBERT that transcribes audiovisual speech with high accuracy. Critically, AVHuBERT was trained only with audiovisual speech, without exposure to McGurk stimuli or other incongruent speech. In the current study, when tested with congruent audiovisual "ba", "ga" and "da" syllables recorded from 8 different talkers, AVHuBERT transcribed them with near-perfect accuracy, and showed a human-like pattern of highest accuracy for audiovisual speech, slightly lower accuracy for auditory-only speech, and low accuracy for visual-only speech. When presented with incongruent McGurk syllables (auditory "ba" paired with visual "ga"), AVHuBERT reported the McGurk fusion percept of "da" at a rate of 25%, many-fold greater than the rate for either auditory or visual components of the McGurk stimulus presented on their own. To examine the individual variability that is hallmark of human perception of the McGurk effect, 100 variants of AVHuBERT were constructed. Like human observers, AVHuBERT variants was consistently accurate for congruent syllables but highly variable for McGurk syllables. Similarities between the responses of AVHuBERT and humans to congruent and incongruent audiovisual speech, including the McGurk effect, suggests that DNNs may be a useful tool for interrogating the perceptual and neural mechanisms of human audiovisual speech perception.
在麦格克效应中,不一致的听觉和视觉音节会被感知为第三个虚幻的音节。对该效应的主流解释是,虚幻音节是在原本不兼容的听觉和视觉表征之间的一种共识感知。为了验证这一观点,我们转向了一个名为AVHuBERT的深度神经网络,它能高精度地转录视听语音。关键的是,AVHuBERT仅用视听语音进行训练,未接触过麦格克刺激或其他不一致的语音。在当前研究中,当用从8个不同说话者录制的一致视听“ba”“ga”和“da”音节进行测试时,AVHuBERT转录的准确率近乎完美,并且呈现出一种类似人类的模式:视听语音准确率最高,纯听觉语音准确率略低,纯视觉语音准确率很低。当呈现不一致的麦格克音节(听觉“ba”与视觉“ga”配对)时,AVHuBERT报告“da”的麦格克融合感知的比率为25%,这比单独呈现的麦格克刺激的听觉或视觉成分的比率高出许多倍。为了研究人类对麦格克效应感知的标志性个体差异,构建了100个AVHuBERT变体。与人类观察者一样,AVHuBERT变体对一致音节始终准确,但对麦格克音节高度可变。AVHuBERT与人类对一致和不一致视听语音(包括麦格克效应)的反应之间的相似性表明,深度神经网络可能是探究人类视听语音感知的感知和神经机制的有用工具。