系统测试三种语言模型发现，它们语言准确性低，缺乏响应稳定性，并且存在肯定回答偏见。

Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias.

机构信息

Departament d'Estudis Anglesos i Alemanys, Universitat Rovira i Virgili, Tarragona 43002, Spain.

Institut für Psychologie, Humboldt-Universitat zu Berlin, Berlin 10099, Germany.

出版信息

Proc Natl Acad Sci U S A. 2023 Dec 19;120(51):e2309583120. doi: 10.1073/pnas.2309583120. Epub 2023 Dec 13.

DOI:10.1073/pnas.2309583120

PMID:38091290

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10743380/

Abstract

Humans are universally good in providing stable and accurate judgments about what forms part of their language and what not. Large Language Models (LMs) are claimed to possess human-like language abilities; hence, they are expected to emulate this behavior by providing both stable and accurate answers, when asked whether a string of words complies with or deviates from their next-word predictions. This work tests whether stability and accuracy are showcased by GPT-3/text-davinci-002, GPT-3/text-davinci-003, and ChatGPT, using a series of judgment tasks that tap on 8 linguistic phenomena: plural attraction, anaphora, center embedding, comparatives, intrusive resumption, negative polarity items, order of adjectives, and order of adverbs. For every phenomenon, 10 sentences (5 grammatical and 5 ungrammatical) are tested, each randomly repeated 10 times, totaling 800 elicited judgments per LM (total n = 2,400). Our results reveal variable above-chance accuracy in the grammatical condition, below-chance accuracy in the ungrammatical condition, a significant instability of answers across phenomena, and a yes-response bias for all the tested LMs. Furthermore, we found no evidence that repetition aids the Models to converge on a processing strategy that culminates in stable answers, either accurate or inaccurate. We demonstrate that the LMs' performance in identifying (un)grammatical word patterns is in stark contrast to what is observed in humans (n = 80, tested on the same tasks) and argue that adopting LMs as theories of human language is not motivated at their current stage of development.

摘要

人类在判断哪些是自己语言的一部分，哪些不是时，普遍表现出稳定且准确的能力。大型语言模型（LM）被认为具有类似人类的语言能力；因此，当被问及一个单词序列是否符合或偏离其下一个单词的预测时，它们应该表现出稳定且准确的回答。本研究通过一系列判断任务来检验 GPT-3/text-davinci-002、GPT-3/text-davinci-003 和 ChatGPT 是否具有稳定性和准确性，这些任务涉及 8 种语言现象：复数吸引、照应、中置、比较、插入式重复、否定极性项、形容词顺序和副词顺序。对于每种现象，测试 10 个句子（5 个语法正确，5 个语法错误），每个句子随机重复 10 次，每个 LM 共测试 800 个诱发判断（总 n = 2,400）。我们的研究结果显示，在语法条件下，这些模型的准确率高于随机水平，而在语法错误条件下，准确率低于随机水平，答案在不同现象之间存在显著的不稳定性，并且所有测试的 LM 都存在肯定回答的偏差。此外，我们没有发现证据表明重复可以帮助模型收敛到一种导致稳定答案的处理策略，无论是准确的还是不准确的。我们证明，LM 在识别（不）语法单词模式方面的表现与人类（n = 80，在相同任务上进行测试）的表现形成鲜明对比，并认为在当前的发展阶段，将 LM 作为人类语言的理论是没有依据的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2dff/10743380/87e25290f4fc/pnas.2309583120fig01.jpg

相似文献

Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias.

Proc Natl Acad Sci U S A. 2023 Dec 19;120(51):e2309583120. doi: 10.1073/pnas.2309583120. Epub 2023 Dec 13.

Real-word repetition as a predictor of grammatical competence in Italian children with typical language development.

Int J Lang Commun Disord. 2009 Nov-Dec;44(6):941-61. doi: 10.1080/13682820802491794.

A cross-linguistic study of real-word and non-word repetition as predictors of grammatical competence in children with typical language development.

Int J Lang Commun Disord. 2011 Sep-Oct;46(5):564-78. doi: 10.1111/j.1460-6984.2011.00008.x. Epub 2011 Sep 1.

The effect of word transpositions on grammaticality judgements in first and second language sentence reading.

Q J Exp Psychol (Hove). 2024 Jan;77(1):204-216. doi: 10.1177/17470218231161433. Epub 2023 Mar 24.

Transposed-word effects in speeded grammatical decisions to sequences of spoken words.

Sci Rep. 2022 Dec 21;12(1):22035. doi: 10.1038/s41598-022-26584-2.

The grammaticality asymmetry in agreement attraction reflects response bias: Experimental and modeling evidence.

Cogn Psychol. 2019 May;110:70-104. doi: 10.1016/j.cogpsych.2019.01.001. Epub 2019 Feb 22.

A transposed-word effect in Chinese reading.

Atten Percept Psychophys. 2020 Nov;82(8):3788-3794. doi: 10.3758/s13414-020-02114-y.

Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge.

Cogn Sci. 2017 Jul;41(5):1202-1241. doi: 10.1111/cogs.12414. Epub 2016 Oct 12.

Mechanisms for handling nested dependencies in neural-network language models and humans.

Cognition. 2021 Aug;213:104699. doi: 10.1016/j.cognition.2021.104699. Epub 2021 Apr 30.

Judgments of grammaticality of Japanese sentences violating the principle of full interpretation.

J Psycholinguist Res. 2003 Nov;32(6):693-709. doi: 10.1023/a:1026198316202.

引用本文的文献

Language in vivo vs. in silico: Size matters but Larger Language Models still do not comprehend language on a par with humans due to impenetrable semantic reference.

PLoS One. 2025 Jul 17;20(7):e0327794. doi: 10.1371/journal.pone.0327794. eCollection 2025.

Derivational morphology reveals analogical generalization in large language models.

Proc Natl Acad Sci U S A. 2025 May 13;122(19):e2423232122. doi: 10.1073/pnas.2423232122. Epub 2025 May 9.

The sociolinguistic foundations of language modeling.

Front Artif Intell. 2025 Jan 13;7:1472411. doi: 10.3389/frai.2024.1472411. eCollection 2024.

Large language models display human-like social desirability biases in Big Five personality surveys.

PNAS Nexus. 2024 Dec 17;3(12):pgae533. doi: 10.1093/pnasnexus/pgae533. eCollection 2024 Dec.

Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial.

BMC Med Educ. 2024 Nov 28;24(1):1391. doi: 10.1186/s12909-024-06399-7.

Testing AI on language comprehension tasks reveals insensitivity to underlying meaning.

Sci Rep. 2024 Nov 14;14(1):28083. doi: 10.1038/s41598-024-79531-8.

Reply to Hu et al.: Applying different evaluation standards to humans vs. Large Language Models overestimates AI performance.

Proc Natl Acad Sci U S A. 2024 Sep 3;121(36):e2406752121. doi: 10.1073/pnas.2406752121. Epub 2024 Aug 26.

Language models align with human judgments on key grammatical constructions.

Proc Natl Acad Sci U S A. 2024 Sep 3;121(36):e2400917121. doi: 10.1073/pnas.2400917121. Epub 2024 Aug 26.

Can Generative AI improve social science?

Proc Natl Acad Sci U S A. 2024 May 21;121(21):e2314021121. doi: 10.1073/pnas.2314021121. Epub 2024 May 9.

本文引用的文献

Neural Networks as Cognitive Models of the Processing of Syntactic Constraints.

Open Mind (Camb). 2024 May 6;8:558-614. doi: 10.1162/opmi_a_00137. eCollection 2024.

Synthesizing theories of human language with Bayesian program induction.

Nat Commun. 2022 Aug 30;13(1):5024. doi: 10.1038/s41467-022-32012-w.

A model for learning strings is not a model of language.

Proc Natl Acad Sci U S A. 2022 Jun 7;119(23):e2201651119. doi: 10.1073/pnas.2201651119. Epub 2022 Jun 1.

Addressing Challenges in Formal Research on Moribund Heritage Languages: A Path Forward.

Front Psychol. 2021 Jul 29;12:700126. doi: 10.3389/fpsyg.2021.700126. eCollection 2021.

Acceptable Ungrammatical Sentences, Unacceptable Grammatical Sentences, and the Role of the Cognitive Parser.

Front Psychol. 2020 Mar 10;11:364. doi: 10.3389/fpsyg.2020.00364. eCollection 2020.

Setting the empirical record straight: Acceptability judgments appear to be reliable, robust, and replicable.

Behav Brain Sci. 2017 Jan;40:e311. doi: 10.1017/S0140525X17000590.

Negative polarity illusions and the format of hierarchical encodings in memory.

Cognition. 2016 Dec;157:321-339. doi: 10.1016/j.cognition.2016.08.016. Epub 2016 Oct 7.

Structures, Not Strings: Linguistics as Part of the Cognitive Sciences.

Trends Cogn Sci. 2015 Dec;19(12):729-743. doi: 10.1016/j.tics.2015.09.008. Epub 2015 Nov 10.

Good-enough linguistic representations and online cognitive equilibrium in language processing.

Q J Exp Psychol (Hove). 2016;69(5):1013-40. doi: 10.1080/17470218.2015.1053951. Epub 2015 Jun 23.

Adult reformulations of child errors as negative evidence.

J Child Lang. 2003 Aug;30(3):637-69.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

系统测试三种语言模型发现，它们语言准确性低，缺乏响应稳定性，并且存在肯定回答偏见。

Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias.

机构信息

Departament d'Estudis Anglesos i Alemanys, Universitat Rovira i Virgili, Tarragona 43002, Spain.

Institut für Psychologie, Humboldt-Universitat zu Berlin, Berlin 10099, Germany.

出版信息

Proc Natl Acad Sci U S A. 2023 Dec 19;120(51):e2309583120. doi: 10.1073/pnas.2309583120. Epub 2023 Dec 13.

DOI:10.1073/pnas.2309583120

PMID:38091290

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10743380/

Abstract

摘要

系统测试三种语言模型发现，它们语言准确性低，缺乏响应稳定性，并且存在肯定回答偏见。

Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

系统测试三种语言模型发现，它们语言准确性低，缺乏响应稳定性，并且存在肯定回答偏见。

Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献