Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany.
J Neurosci. 2011 Aug 3;31(31):11338-50. doi: 10.1523/JNEUROSCI.6510-10.2011.
Face-to-face communication challenges the human brain to integrate information from auditory and visual senses with linguistic representations. Yet the role of bottom-up physical (spectrotemporal structure) input and top-down linguistic constraints in shaping the neural mechanisms specialized for integrating audiovisual speech signals are currently unknown. Participants were presented with speech and sinewave speech analogs in visual, auditory, and audiovisual modalities. Before the fMRI study, they were trained to perceive physically identical sinewave speech analogs as speech (SWS-S) or nonspeech (SWS-N). Comparing audiovisual integration (interactions) of speech, SWS-S, and SWS-N revealed a posterior-anterior processing gradient within the left superior temporal sulcus/gyrus (STS/STG): Bilateral posterior STS/STG integrated audiovisual inputs regardless of spectrotemporal structure or speech percept; in left mid-STS, the integration profile was primarily determined by the spectrotemporal structure of the signals; more anterior STS regions discarded spectrotemporal structure and integrated audiovisual signals constrained by stimulus intelligibility and the availability of linguistic representations. In addition to this "ventral" processing stream, a "dorsal" circuitry encompassing posterior STS/STG and left inferior frontal gyrus differentially integrated audiovisual speech and SWS signals. Indeed, dynamic causal modeling and Bayesian model comparison provided strong evidence for a parallel processing structure encompassing a ventral and a dorsal stream with speech intelligibility training enhancing the connectivity between posterior and anterior STS/STG. In conclusion, audiovisual speech comprehension emerges in an interactive process with the integration of auditory and visual signals being progressively constrained by stimulus intelligibility along the STS and spectrotemporal structure in a dorsal fronto-temporal circuitry.
面对面的交流挑战了人类大脑,使其能够将来自听觉和视觉感官的信息与语言表达相结合。然而,目前尚不清楚在塑造专门用于整合视听言语信号的神经机制方面,来自物理层面(频谱-时程结构)的输入和来自语言层面的约束因素各自扮演了何种角色。研究人员在视觉、听觉和视听模式下呈现言语和正弦言语模拟信号。在 fMRI 研究之前,参与者被训练感知物理上完全相同的正弦言语模拟信号是言语(SWS-S)还是非言语(SWS-N)。通过比较言语、SWS-S 和 SWS-N 的视听整合(相互作用),发现左侧颞上回/回(STS/STG)存在前后处理梯度:双侧后部 STS/STG 无论信号的频谱-时程结构或言语感知如何,都整合视听输入;在左侧中颞区,整合模式主要由信号的频谱-时程结构决定;更靠前的 STS 区域则忽略频谱-时程结构,根据刺激可理解度和语言表达的可用性来整合视听信号。除了这条“腹侧”处理流,一条包括后部 STS/STG 和左侧额下回的“背侧”回路也能对视听言语和 SWS 信号进行不同的整合。事实上,动态因果建模和贝叶斯模型比较为包含腹侧和背侧流的并行处理结构提供了有力证据,言语可理解度训练增强了后部和前部 STS/STG 之间的连通性。总之,视听言语理解是一个互动过程,在这个过程中,随着 STS 上刺激可理解度的增加,视听信号的整合逐渐受到约束,同时在背侧额颞回路中受到频谱-时程结构的约束。