Departament d'Estudis Anglesos i Alemanys, Universitat Rovira i Virgili, Tarragona 43002, Spain.
Institut für Psychologie, Humboldt-Universitat zu Berlin, Berlin 10099, Germany.
Proc Natl Acad Sci U S A. 2023 Dec 19;120(51):e2309583120. doi: 10.1073/pnas.2309583120. Epub 2023 Dec 13.
Humans are universally good in providing stable and accurate judgments about what forms part of their language and what not. Large Language Models (LMs) are claimed to possess human-like language abilities; hence, they are expected to emulate this behavior by providing both stable and accurate answers, when asked whether a string of words complies with or deviates from their next-word predictions. This work tests whether stability and accuracy are showcased by GPT-3/text-davinci-002, GPT-3/text-davinci-003, and ChatGPT, using a series of judgment tasks that tap on 8 linguistic phenomena: plural attraction, anaphora, center embedding, comparatives, intrusive resumption, negative polarity items, order of adjectives, and order of adverbs. For every phenomenon, 10 sentences (5 grammatical and 5 ungrammatical) are tested, each randomly repeated 10 times, totaling 800 elicited judgments per LM (total n = 2,400). Our results reveal variable above-chance accuracy in the grammatical condition, below-chance accuracy in the ungrammatical condition, a significant instability of answers across phenomena, and a yes-response bias for all the tested LMs. Furthermore, we found no evidence that repetition aids the Models to converge on a processing strategy that culminates in stable answers, either accurate or inaccurate. We demonstrate that the LMs' performance in identifying (un)grammatical word patterns is in stark contrast to what is observed in humans (n = 80, tested on the same tasks) and argue that adopting LMs as theories of human language is not motivated at their current stage of development.
人类在判断哪些是自己语言的一部分,哪些不是时,普遍表现出稳定且准确的能力。大型语言模型(LM)被认为具有类似人类的语言能力;因此,当被问及一个单词序列是否符合或偏离其下一个单词的预测时,它们应该表现出稳定且准确的回答。本研究通过一系列判断任务来检验 GPT-3/text-davinci-002、GPT-3/text-davinci-003 和 ChatGPT 是否具有稳定性和准确性,这些任务涉及 8 种语言现象:复数吸引、照应、中置、比较、插入式重复、否定极性项、形容词顺序和副词顺序。对于每种现象,测试 10 个句子(5 个语法正确,5 个语法错误),每个句子随机重复 10 次,每个 LM 共测试 800 个诱发判断(总 n = 2,400)。我们的研究结果显示,在语法条件下,这些模型的准确率高于随机水平,而在语法错误条件下,准确率低于随机水平,答案在不同现象之间存在显著的不稳定性,并且所有测试的 LM 都存在肯定回答的偏差。此外,我们没有发现证据表明重复可以帮助模型收敛到一种导致稳定答案的处理策略,无论是准确的还是不准确的。我们证明,LM 在识别(不)语法单词模式方面的表现与人类(n = 80,在相同任务上进行测试)的表现形成鲜明对比,并认为在当前的发展阶段,将 LM 作为人类语言的理论是没有依据的。