Dentella Vittoria, Günther Fritz, Leivada Evelina
Department of Brain and Behavioral Sciences, University of Pavia, Pavia, Italy.
Institut für Psychologie, Humboldt-Universität zu Berlin, Berlin, Germany.
PLoS One. 2025 Jul 17;20(7):e0327794. doi: 10.1371/journal.pone.0327794. eCollection 2025.
Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining whether increases in size make up for such differences between humans and models. We test three LLMs from different families (Bard, 137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. N = 1,200 judgments are collected and scored for accuracy, stability, and improvements in accuracy upon repeated presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are compared to results of n = 80 humans on the same stimuli. We find that humans are overall less accurate than ChatGPT-4 (76% vs. 80% accuracy, respectively), but that this is due to ChatGPT-4 outperforming humans only in one task condition, namely on grammatical sentences. Additionally, ChatGPT-4 wavers more than humans in its answers (12.5% vs. 9.6% likelihood of an oscillating answer, respectively). Thus, while increased model size may lead to better performance, LLMs are still not sensitive to (un)grammaticality the same way as humans are. It seems possible but unlikely that scaling alone can fix this issue. We interpret these results by comparing language learning in vivo and in silico, identifying three critical differences concerning (i) the type of evidence, (ii) the poverty of the stimulus, and (iii) the occurrence of semantic hallucinations due to impenetrable linguistic reference.
理解语言的局限性是大语言模型(LLMs)成为自然语言理论的先决条件。大语言模型在某些语言任务中的表现与人类存在数量和质量上的差异,然而,这种差异是否可通过模型规模来弥补仍有待确定。这项工作研究了模型扩展的关键作用,确定规模的增加是否能弥补人类与模型之间的这种差异。我们在一个包含指代、中心嵌入、比较级和负极性的语法判断任务中测试了来自不同家族的三个大语言模型(Bard,1370亿个参数;ChatGPT-3.5,1750亿个;ChatGPT-4,1.5万亿个)。收集了1200个判断结果,并根据准确性、稳定性以及重复呈现提示后准确性的提高进行评分。将表现最佳的大语言模型ChatGPT-4的结果与80名人类在相同刺激下的结果进行比较。我们发现,总体而言人类的准确性低于ChatGPT-4(分别为76%和80%),但这是因为ChatGPT-4仅在一种任务条件下表现优于人类,即在语法正确的句子上。此外,ChatGPT-4的答案比人类更不稳定(分别为12.5%和9.6%的振荡答案可能性)。因此,虽然模型规模的增加可能会带来更好的性能,但大语言模型对(不)语法性的敏感度仍与人类不同。仅靠扩展规模似乎有可能但不太可能解决这个问题。我们通过比较体内和计算机模拟的语言学习来解释这些结果,确定了关于(i)证据类型、(ii)刺激的匮乏以及(iii)由于难以穿透的语言指称导致的语义幻觉出现的三个关键差异。