Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology.
McGovern Institute for Brain Research, Massachusetts Institute of Technology.
Cogn Sci. 2023 Nov;47(11):e13386. doi: 10.1111/cogs.13386.
Word co-occurrence patterns in language corpora contain a surprising amount of conceptual knowledge. Large language models (LLMs), trained to predict words in context, leverage these patterns to achieve impressive performance on diverse semantic tasks requiring world knowledge. An important but understudied question about LLMs' semantic abilities is whether they acquire generalized knowledge of common events. Here, we test whether five pretrained LLMs (from 2018's BERT to 2023's MPT) assign a higher likelihood to plausible descriptions of agent-patient interactions than to minimally different implausible versions of the same event. Using three curated sets of minimal sentence pairs (total n = 1215), we found that pretrained LLMs possess substantial event knowledge, outperforming other distributional language models. In particular, they almost always assign a higher likelihood to possible versus impossible events (The teacher bought the laptop vs. The laptop bought the teacher). However, LLMs show less consistent preferences for likely versus unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLM scores generalize well across syntactic variants (active vs. passive constructions) but less well across semantic variants (synonymous sentences), (iii) some LLM errors mirror human judgment ambiguity, and (iv) sentence plausibility serves as an organizing dimension in internal LLM representations. Overall, our results show that important aspects of event knowledge naturally emerge from distributional linguistic patterns, but also highlight a gap between representations of possible/impossible and likely/unlikely events.
语言语料库中的单词共现模式包含了令人惊讶的大量概念知识。经过训练以预测上下文中单词的大型语言模型 (LLM) 利用这些模式在需要世界知识的各种语义任务中取得了令人印象深刻的表现。关于 LLM 语义能力的一个重要但研究不足的问题是,它们是否获得了常见事件的通用知识。在这里,我们测试了五个预训练的 LLM(从 2018 年的 BERT 到 2023 年的 MPT)是否比同一事件的最小差异不太可能的版本更有可能分配到代理-患者交互的合理描述。使用三个精心策划的最小句子对集(总共 n = 1215),我们发现预训练的 LLM 具有大量的事件知识,优于其他分布语言模型。特别是,它们几乎总是将更高的可能性分配给可能的事件而不是不可能的事件(老师买了笔记本电脑 vs. 笔记本电脑买了老师)。然而,LLM 对可能的事件与不太可能的事件的偏好不太一致(保姆辅导男孩 vs. 男孩辅导保姆)。在后续分析中,我们表明:(i) LLM 分数既受合理性又受表面句子特征驱动,(ii) LLM 分数在句法变体(主动与被动结构)上很好地概括,但在语义变体(同义词句)上概括较差,(iii) 一些 LLM 错误反映了人类判断的模糊性,(iv) 句子合理性在内部 LLM 表示中充当组织维度。总体而言,我们的结果表明,事件知识的重要方面自然从分布语言模式中出现,但也突出了可能/不可能和可能/不太可能事件的表示之间的差距。