基于一致视听语音训练的深度神经网络呈现麦格克效应。

A Deep Neural Network Trained on Congruent Audiovisual Speech Reports the McGurk Effect.

作者信息

Ma Haotian, Wang Zhengjia, Zhang Xiang, Magnotti John F, Beauchamp Michael S

机构信息

Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA.

出版信息

bioRxiv. 2025 Aug 24:2025.08.20.671347. doi: 10.1101/2025.08.20.671347.

DOI:10.1101/2025.08.20.671347

PMID:40894527

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12393562/

Abstract

In the McGurk effect, incongruent auditory and visual syllables are perceived as a third, illusory syllable. The prevailing explanation for the effect is that the illusory syllable is a consensus percept intermediate between otherwise incompatible auditory and visual representations. To test this idea, we turned to a deep neural network known as AVHuBERT that transcribes audiovisual speech with high accuracy. Critically, AVHuBERT was trained only with audiovisual speech, without exposure to McGurk stimuli or other incongruent speech. In the current study, when tested with congruent audiovisual "ba", "ga" and "da" syllables recorded from 8 different talkers, AVHuBERT transcribed them with near-perfect accuracy, and showed a human-like pattern of highest accuracy for audiovisual speech, slightly lower accuracy for auditory-only speech, and low accuracy for visual-only speech. When presented with incongruent McGurk syllables (auditory "ba" paired with visual "ga"), AVHuBERT reported the McGurk fusion percept of "da" at a rate of 25%, many-fold greater than the rate for either auditory or visual components of the McGurk stimulus presented on their own. To examine the individual variability that is hallmark of human perception of the McGurk effect, 100 variants of AVHuBERT were constructed. Like human observers, AVHuBERT variants was consistently accurate for congruent syllables but highly variable for McGurk syllables. Similarities between the responses of AVHuBERT and humans to congruent and incongruent audiovisual speech, including the McGurk effect, suggests that DNNs may be a useful tool for interrogating the perceptual and neural mechanisms of human audiovisual speech perception.

摘要

在麦格克效应中，不一致的听觉和视觉音节会被感知为第三个虚幻的音节。对该效应的主流解释是，虚幻音节是在原本不兼容的听觉和视觉表征之间的一种共识感知。为了验证这一观点，我们转向了一个名为AVHuBERT的深度神经网络，它能高精度地转录视听语音。关键的是，AVHuBERT仅用视听语音进行训练，未接触过麦格克刺激或其他不一致的语音。在当前研究中，当用从8个不同说话者录制的一致视听“ba”“ga”和“da”音节进行测试时，AVHuBERT转录的准确率近乎完美，并且呈现出一种类似人类的模式：视听语音准确率最高，纯听觉语音准确率略低，纯视觉语音准确率很低。当呈现不一致的麦格克音节（听觉“ba”与视觉“ga”配对）时，AVHuBERT报告“da”的麦格克融合感知的比率为25%，这比单独呈现的麦格克刺激的听觉或视觉成分的比率高出许多倍。为了研究人类对麦格克效应感知的标志性个体差异，构建了100个AVHuBERT变体。与人类观察者一样，AVHuBERT变体对一致音节始终准确，但对麦格克音节高度可变。AVHuBERT与人类对一致和不一致视听语音（包括麦格克效应）的反应之间的相似性表明，深度神经网络可能是探究人类视听语音感知的感知和神经机制的有用工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8733/12393562/672fdbfc54be/nihpp-2025.08.20.671347v1-f0001.jpg

相似文献

A Deep Neural Network Trained on Congruent Audiovisual Speech Reports the McGurk Effect.基于一致视听语音训练的深度神经网络呈现麦格克效应。

bioRxiv. 2025 Aug 24:2025.08.20.671347. doi: 10.1101/2025.08.20.671347.

Evidence for a Causal Dissociation of the McGurk Effect and Congruent Audiovisual Speech Perception via TMS to the Left pSTS.经左颞上沟重复经颅磁刺激对 McGurk 效应和一致视听言语感知的因果分离的证据。

Multisens Res. 2024 Aug 16;37(4-5):341-363. doi: 10.1163/22134808-bja10129.

The noisy encoding of disparity model predicts perception of the McGurk effect in native Japanese speakers.视差模型的噪声编码预测了以日语为母语者对麦格克效应的感知。

Front Neurosci. 2024 Jun 26;18:1421713. doi: 10.3389/fnins.2024.1421713. eCollection 2024.

The McGurk effect is similar in native Mandarin Chinese and American English speakers.麦格克效应在以普通话为母语的中国人和以美式英语为母语的人中表现相似。

Front Psychol. 2025 Mar 28;16:1531566. doi: 10.3389/fpsyg.2025.1531566. eCollection 2025.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Interventions for childhood apraxia of speech.儿童言语失用症的干预措施。

Cochrane Database Syst Rev. 2018 May 30;5(5):CD006278. doi: 10.1002/14651858.CD006278.pub3.

The agreement of phonetic transcriptions between paediatric speech and language therapists transcribing a disordered speech sample.儿科言语和语言治疗师转写语音样本的音标转录的一致性。

Int J Lang Commun Disord. 2024 Sep-Oct;59(5):1981-1995. doi: 10.1111/1460-6984.13043. Epub 2024 Jun 8.

Repeatedly experiencing the McGurk effect induces long-lasting changes in auditory speech perception.反复体验麦格克效应会引起听觉言语感知的长期变化。

Commun Psychol. 2024 Apr 3;2(1):25. doi: 10.1038/s44271-024-00073-w.

Seeing a Talker's Mouth Reduces the Effort of Perceiving Speech and Repairing Perceptual Mistakes for Listeners With Cochlear Implants.看到说话者的嘴部动作可减轻人工耳蜗佩戴者感知语音和纠正感知错误的难度。

Ear Hear. 2025 Jun 16. doi: 10.1097/AUD.0000000000001683.

Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施：系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。

Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.

本文引用的文献

Variations in unisensory speech perception explain interindividual differences in McGurk illusion susceptibility.单感官言语感知的差异解释了个体在麦格克错觉易感性上的个体间差异。

Psychon Bull Rev. 2025 Apr 24. doi: 10.3758/s13423-025-02697-3.

The McGurk effect is similar in native Mandarin Chinese and American English speakers.麦格克效应在以普通话为母语的中国人和以美式英语为母语的人中表现相似。

Front Psychol. 2025 Mar 28;16:1531566. doi: 10.3389/fpsyg.2025.1531566. eCollection 2025.

Multisensory integration operates on correlated input from unimodal transient channels.多感官整合作用于来自单峰瞬态通道的相关输入。

Elife. 2025 Jan 22;12:RP90841. doi: 10.7554/eLife.90841.

Models optimized for real-world tasks reveal the task-dependent necessity of precise temporal coding in hearing.针对现实世界任务进行优化的模型揭示了听觉中精确时间编码的任务依赖性必要性。

Nat Commun. 2024 Dec 4;15(1):10590. doi: 10.1038/s41467-024-54700-5.

The noisy encoding of disparity model predicts perception of the McGurk effect in native Japanese speakers.视差模型的噪声编码预测了以日语为母语者对麦格克效应的感知。

Front Neurosci. 2024 Jun 26;18:1421713. doi: 10.3389/fnins.2024.1421713. eCollection 2024.

Shared functional specialization in transformer-based language models and the human brain.基于变压器的语言模型和人类大脑的功能专业化共享。

Nat Commun. 2024 Jun 29;15(1):5523. doi: 10.1038/s41467-024-49173-5.

Synthetic faces generated with the facial action coding system or deep neural networks improve speech-in-noise perception, but not as much as real faces.通过面部动作编码系统或深度神经网络生成的合成面孔可改善噪声环境下的语音感知，但效果不如真实面孔。

Front Neurosci. 2024 May 9;18:1379988. doi: 10.3389/fnins.2024.1379988. eCollection 2024.

Artificial Neural Network Language Models Predict Human Brain Responses to Language Even After a Developmentally Realistic Amount of Training.即使经过符合发育实际的训练量，人工神经网络语言模型仍能预测人类大脑对语言的反应。

Neurobiol Lang (Camb). 2024 Apr 1;5(1):43-63. doi: 10.1162/nol_a_00137. eCollection 2024.

Dissecting neural computations in the human auditory pathway using deep neural networks for speech.利用用于语音的深度神经网络解析人类听觉通路中的神经计算。

Nat Neurosci. 2023 Dec;26(12):2213-2225. doi: 10.1038/s41593-023-01468-4. Epub 2023 Oct 30.

How multisensory neurons solve causal inference.多感觉神经元如何解决因果推断问题。

Proc Natl Acad Sci U S A. 2021 Aug 10;118(32). doi: 10.1073/pnas.2106235118.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于一致视听语音训练的深度神经网络呈现麦格克效应。

A Deep Neural Network Trained on Congruent Audiovisual Speech Reports the McGurk Effect.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献