Lampinen Andrew K, Dasgupta Ishita, Chan Stephanie C Y, Sheahan Hannah R, Creswell Antonia, Kumaran Dharshan, McClelland James L, Hill Felix
Google DeepMind, Mountain View, CA, 94043 USA.
Google DeepMind, London N1C 4DN, UK.
PNAS Nexus. 2024 Jul 16;3(7):pgae233. doi: 10.1093/pnasnexus/pgae233. eCollection 2024 Jul.
reasoning is a key ability for an intelligent system. Large language models (LMs) achieve above-chance performance on abstract reasoning tasks but exhibit many imperfections. However, human abstract reasoning is also imperfect. Human reasoning is affected by our real-world knowledge and beliefs, and shows notable "content effects"; humans reason more reliably when the semantic content of a problem supports the correct logical inferences. These content-entangled reasoning patterns are central to debates about the fundamental nature of human intelligence. Here, we investigate whether language models-whose prior expectations capture some aspects of human knowledge-similarly mix content into their answers to logic problems. We explored this question across three logical reasoning tasks: natural language inference, judging the logical validity of syllogisms, and the Wason selection task. We evaluate state of the art LMs, as well as humans, and find that the LMs reflect many of the same qualitative human patterns on these tasks-like humans, models answer more accurately when the semantic content of a task supports the logical inferences. These parallels are reflected in accuracy patterns, and in some lower-level features like the relationship between LM confidence over possible answers and human response times. However, in some cases the humans and models behave differently-particularly on the Wason task, where humans perform much worse than large models, and exhibit a distinct error pattern. Our findings have implications for understanding possible contributors to these human cognitive effects, as well as the factors that influence language model performance.
推理是智能系统的一项关键能力。大语言模型(LMs)在抽象推理任务上的表现高于随机水平,但也存在许多不足之处。然而,人类的抽象推理也并非完美无缺。人类推理会受到我们现实世界的知识和信念的影响,并表现出显著的“内容效应”;当问题的语义内容支持正确的逻辑推理时,人类的推理会更可靠。这些与内容纠缠在一起的推理模式是关于人类智能基本性质的辩论的核心。在这里,我们研究了语言模型——其先验期望捕捉了人类知识的某些方面——是否同样会将内容混入它们对逻辑问题的回答中。我们在三项逻辑推理任务中探讨了这个问题:自然语言推理、判断三段论的逻辑有效性以及沃森选择任务。我们评估了最先进的语言模型以及人类,发现语言模型在这些任务上反映出了许多与人类相同的定性模式——和人类一样,当任务的语义内容支持逻辑推理时,模型的回答更准确。这些相似之处体现在准确率模式上,以及一些较低层次的特征上,比如语言模型对可能答案的置信度与人类反应时间之间的关系。然而,在某些情况下,人类和模型的表现有所不同——特别是在沃森任务中,人类的表现比大模型差得多,并且呈现出一种独特的错误模式。我们的发现对于理解这些人类认知效应的可能成因以及影响语言模型性能的因素具有启示意义。