Zhang Yue, Magnotti John F, Zhang Xiang, Wang Zhengjia, Yu Yingjia, Davis Kathryn A, Sheth Sameer A, Chen H Isaac, Yoshor Daniel, Beauchamp Michael S
Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA.
Department of Neurosurgery, Baylor College of Medicine, Houston, TX.
J Neurosci. 2025 Sep 10. doi: 10.1523/JNEUROSCI.1037-25.2025.
Human speech perception is multisensory, integrating auditory information from the talker's voice with visual information from the talker's face. BOLD fMRI studies have implicated the superior temporal gyrus (STG) in processing auditory speech and the superior temporal sulcus (STS) in integrating auditory and visual speech, but as an indirect hemodynamic measure, fMRI is limited in its ability to track the rapid neural computations underlying speech perception. Using stereoelectroencephalograpy (sEEG) electrodes, we directly recorded from the STG and STS in 42 epilepsy patients (25 F, 17 M). Participants identified single words presented in auditory, visual and audiovisual formats with and without added auditory noise. Seeing the talker's face provided a strong perceptual benefit, improving perception of noisy speech in every participant. Neurally, a subpopulation of electrodes concentrated in mid-posterior STG and STS responded to both auditory speech (latency 71 ms) and visual speech (109 ms). Significant multisensory enhancement was observed, especially in the upper bank of the STS: compared with auditory-only speech, the response latency for audiovisual speech was 40% faster and the response amplitude was 18% larger. In contrast, STG showed neither faster nor larger multisensory responses. Surprisingly, STS response latencies for audiovisual speech were significantly faster than those in the STG (50 ms 64 ms), suggesting a parallel pathway model in which the STG plays the primary role in auditory-only speech perception, while the STS takes the lead in audiovisual speech perception. Together with fMRI, sEEG provides converging evidence that STS plays a key role in multisensory integration. One of the most important functions of the human brain is to communicate with others. During conversation, humans take advantage of visual information from the face of the talker as well as auditory information from the voice of the talker. We directly recorded activity from the brains of epilepsy patients implanted with electrodes in the superior temporal sulcus (STS), a key brain region for speech perception. These recordings showed that hearing the voice and seeing the face of the talker evoked larger and faster neural responses in STS than the talker's voice alone. Multisensory enhancement in the STS may be the neural basis for our ability to better understand noisy speech when we can see the face of the talker.