Michaelov James A, Bergen Benjamin K
Department of Cognitive Science, University of California San Diego.
Open Mind (Camb). 2024 Jun 28;8:859-897. doi: 10.1162/opmi_a_00150. eCollection 2024.
Accounts of human language comprehension propose different mathematical relationships between the contextual probability of a word and how difficult it is to process, including linear, logarithmic, and super-logarithmic ones. However, the empirical evidence favoring any of these over the others is mixed, appearing to vary depending on the index of processing difficulty used and the approach taken to calculate contextual probability. To help disentangle these results, we focus on the mathematical relationship between corpus-derived contextual probability and the N400, a neural index of processing difficulty. Specifically, we use 37 contemporary transformer language models to calculate the contextual probability of stimuli from 6 experimental studies of the N400, and test whether N400 amplitude is best predicted by a linear, logarithmic, super-logarithmic, or sub-logarithmic transformation of the probabilities calculated using these language models, as well as combinations of these transformed metrics. We replicate the finding that on some datasets, a combination of linearly and logarithmically-transformed probability can predict N400 amplitude better than either metric alone. In addition, we find that overall, the best single predictor of N400 amplitude is sub-logarithmically-transformed probability, which for almost all language models and datasets explains all the variance in N400 amplitude otherwise explained by the linear and logarithmic transformations. This is a novel finding that is not predicted by any current theoretical accounts, and thus one that we argue is likely to play an important role in increasing our understanding of how the statistical regularities of language impact language comprehension.
关于人类语言理解的描述提出了单词的上下文概率与处理难度之间不同的数学关系,包括线性、对数和超对数关系。然而,支持其中任何一种关系优于其他关系的实证证据并不一致,似乎会因所使用的处理难度指标和计算上下文概率的方法而异。为了帮助理清这些结果,我们关注源自语料库的上下文概率与N400(一种处理难度的神经指标)之间的数学关系。具体而言,我们使用37个当代变压器语言模型来计算来自6项关于N400的实验研究的刺激的上下文概率,并测试N400波幅是否能通过使用这些语言模型计算出的概率的线性、对数、超对数或次对数变换,以及这些变换后的指标的组合得到最佳预测。我们重复了这样一个发现,即在某些数据集上,线性变换概率和对数变换概率的组合比单独的任何一个指标都能更好地预测N400波幅。此外,我们发现总体而言,N400波幅的最佳单一预测指标是次对数变换概率,对于几乎所有语言模型和数据集,它解释了N400波幅中原本由线性变换和对数变换所解释的所有方差。这是一个新颖的发现,目前没有任何理论描述能够预测到,因此我们认为这一发现可能会在增进我们对语言的统计规律如何影响语言理解的认识方面发挥重要作用。