Wellcome Trust Centre for Neuroimaging, University College London, London WC1N 3BG, United Kingdom.
J Neurosci. 2010 Jan 13;30(2):629-38. doi: 10.1523/JNEUROSCI.2742-09.2010.
We understand speech from different speakers with ease, whereas artificial speech recognition systems struggle with this task. It is unclear how the human brain solves this problem. The conventional view is that speech message recognition and speaker identification are two separate functions and that message processing takes place predominantly in the left hemisphere, whereas processing of speaker-specific information is located in the right hemisphere. Here, we distinguish the contribution of specific cortical regions, to speech recognition and speaker information processing, by controlled manipulation of task and resynthesized speaker parameters. Two functional magnetic resonance imaging studies provide evidence for a dynamic speech-processing network that questions the conventional view. We found that speech recognition regions in left posterior superior temporal gyrus/superior temporal sulcus (STG/STS) also encode speaker-related vocal tract parameters, which are reflected in the amplitude peaks of the speech spectrum, along with the speech message. Right posterior STG/STS activated specifically more to a speaker-related vocal tract parameter change during a speech recognition task compared with a voice recognition task. Left and right posterior STG/STS were functionally connected. Additionally, we found that speaker-related glottal fold parameters (e.g., pitch), which are not reflected in the amplitude peaks of the speech spectrum, are processed in areas immediately adjacent to primary auditory cortex, i.e., in areas in the auditory hierarchy earlier than STG/STS. Our results point to a network account of speech recognition, in which information about the speech message and the speaker's vocal tract are combined to solve the difficult task of understanding speech from different speakers.
我们可以轻松理解来自不同说话者的语音,而人工语音识别系统在这方面却很吃力。目前尚不清楚大脑是如何解决这个问题的。传统观点认为,语音信息识别和说话人识别是两个独立的功能,信息处理主要发生在左半球,而说话人特定信息的处理则位于右半球。在这里,我们通过对任务和重新合成的说话人参数的控制操作,区分了特定皮质区域在语音识别和说话人信息处理方面的贡献。两项功能性磁共振成像研究为质疑传统观点的动态语音处理网络提供了证据。我们发现,左后颞上回/颞上沟(STG/STS)中的语音识别区域也编码与说话人相关的声道参数,这些参数反映在语音频谱的幅度峰值中,与语音信息一起。与语音识别任务相比,在语音识别任务中,右后 STG/STS 区域专门对与说话人相关的声道参数变化有更多的激活。左、右后 STG/STS 区域具有功能连接。此外,我们发现,与说话人相关的声门褶皱参数(如音高),这些参数不反映在语音频谱的幅度峰值中,在紧邻初级听觉皮层的区域(即听觉层次结构中比 STG/STS 更早的区域)中得到处理。我们的研究结果表明,语音识别是一个网络模型,其中关于语音信息和说话人声道的信息被结合起来,以解决理解来自不同说话者的语音这一难题。