Department of Brain and Cognitive Sciences, University of Rochester, Rochester, New York, USA.
Department of Psychiatry and Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Charlestown, Massachusetts, USA.
Top Cogn Sci. 2021 Apr;13(2):351-398. doi: 10.1111/tops.12529. Epub 2021 Mar 29.
A classic problem in spoken language comprehension is how listeners perceive speech as being composed of discrete words, given the variable time-course of information in continuous signals. We propose a syllable inference account of spoken word recognition and segmentation, according to which alternative hierarchical models of syllables, words, and phonemes are dynamically posited, which are expected to maximally predict incoming sensory input. Generative models are combined with current estimates of context speech rate drawn from neural oscillatory dynamics, which are sensitive to amplitude rises. Over time, models which result in local minima in error between predicted and recently experienced signals give rise to perceptions of hearing words. Three experiments using the visual world eye-tracking paradigm with a picture-selection task tested hypotheses motivated by this framework. Materials were sentences that were acoustically ambiguous in numbers of syllables, words, and phonemes they contained (cf. English plural constructions, such as "saw (a) raccoon(s) swimming," which have two loci of grammatical information). Time-compressing, or expanding, speech materials permitted determination of how temporal information at, or in the context of, each locus affected looks to, and selection of, pictures with a singular or plural referent (e.g., one or more than one raccoon). Supporting our account, listeners probabilistically interpreted identical chunks of speech as consistent with a singular or plural referent to a degree that was based on the chunk's gradient rate in relation to its context. We interpret these results as evidence that arriving temporal information, judged in relation to language model predictions generated from context speech rate evaluated on a continuous scale, informs inferences about syllables, thereby giving rise to perceptual experiences of understanding spoken language as words separated in time.
口语理解中的一个经典问题是,鉴于连续信号中信息的时变过程,听众如何将语音感知为离散的单词。我们提出了一种音节推断的口语识别和分割方法,根据该方法,音节、单词和音素的替代层次模型是动态提出的,这些模型预计将最大限度地预测传入的感觉输入。生成模型与当前从神经振荡动力学中得出的上下文语音率的估计相结合,这些估计对幅度上升敏感。随着时间的推移,导致预测信号和最近经历的信号之间的误差局部最小的模型会产生听到单词的感觉。使用视觉世界眼动追踪范式和图片选择任务进行了三个实验,以检验该框架所激发的假设。材料是音节、单词和音素数量上有歧义的句子(例如英语复数结构,如“saw (a) raccoon(s) swimming”,其中有两个语法信息位置)。压缩或扩展语音材料可以确定在每个位置或在位置上下文中的时间信息如何影响对单数或复数指称的图片的注视和选择(例如,一个或多个浣熊)。支持我们的说法,听众以基于与上下文相关的渐变率的程度,概率性地将相同的语音片段解释为与单数或复数指称一致。我们将这些结果解释为,到达的时间信息,根据从上下文语音率生成的语言模型预测进行判断,并在连续尺度上进行评估,从而提供有关音节的信息,从而产生对口语作为时间上分离的单词的理解的感知体验。