Suppr超能文献

运用人类工具挑战大型语言模型的“”:一项关于意大利语前额叶功能的神经心理学研究。 注:原文中“Challenging large language models' "" with human tools”这里双引号里内容缺失,翻译可能不太准确,需结合完整原文进一步理解。

Challenging large language models' "" with human tools: A neuropsychological investigation in Italian language on prefrontal functioning.

作者信息

Loconte Riccardo, Orrù Graziella, Tribastone Mirco, Pietrini Pietro, Sartori Giuseppe

机构信息

Molecular Mind Lab, IMT School of Advanced Studies Lucca, Lucca, Italy.

University of Pisa, Pisa, Italy.

出版信息

Heliyon. 2024 Oct 3;10(19):e38911. doi: 10.1016/j.heliyon.2024.e38911. eCollection 2024 Oct 15.

Abstract

The Artificial Intelligence (AI) research community has used ad-hoc benchmarks to measure the "" level of Large Language Models (LLMs). In humans, intelligence is closely linked to the functional integrity of the prefrontal lobes, which are essential for higher-order cognitive processes. Previous research has found that LLMs struggle with cognitive tasks that rely on these prefrontal functions, highlighting a significant challenge in replicating human-like intelligence. In December 2022, OpenAI released ChatGPT, a new chatbot based on the GPT-3.5 model that quickly gained popularity for its impressive ability to understand and respond to human instructions, suggesting a significant step towards intelligent behaviour in AI. Therefore, to rigorously investigate LLMs' level of "," we evaluated the GPT-3.5 and GPT-4 versions through a neuropsychological assessment using tests in the Italian language routinely employed to assess prefrontal functioning in humans. The same tests were also administered to Claude2 and Llama2 to verify whether similar language models perform similarly in prefrontal tests. When using human performance as a reference, GPT-3.5 showed inhomogeneous results on prefrontal tests, with some tests well above average, others in the lower range, and others frankly impaired. Specifically, we have identified poor planning abilities and difficulty in recognising semantic absurdities and understanding others' intentions and mental states. Claude2 exhibited a similar pattern to GPT-3.5, while Llama2 performed poorly in almost all tests. These inconsistent profiles highlight how LLMs' emergent abilities do not yet mimic human cognitive functioning. The sole exception was GPT-4, which performed within the normative range for all the tasks except planning. Furthermore, we showed how standardised neuropsychological batteries developed to assess human cognitive functions may be suitable for challenging LLMs' performance.

摘要

人工智能(AI)研究界一直使用临时基准来衡量大语言模型(LLMs)的“智能水平”。在人类中,智力与前额叶的功能完整性密切相关,前额叶对于高阶认知过程至关重要。先前的研究发现,大语言模型在依赖这些前额叶功能的认知任务上存在困难,这凸显了在复制类人智能方面的重大挑战。2022年12月,OpenAI发布了ChatGPT,这是一个基于GPT - 3.5模型的新聊天机器人,因其理解和响应人类指令的出色能力迅速受到欢迎,这表明在人工智能的智能行为方面迈出了重要一步。因此,为了严格调查大语言模型的“智能水平”,我们通过使用意大利语测试进行神经心理学评估,对GPT - 3.5和GPT - 4版本进行了评估,这些测试通常用于评估人类的前额叶功能。同样的测试也应用于Claude2和Llama2,以验证类似的语言模型在前额叶测试中是否表现相似。以人类表现作为参考时,GPT - 3.5在前额叶测试中表现出不均匀的结果,一些测试远高于平均水平,一些处于较低范围,还有一些明显受损。具体而言,我们发现其规划能力较差,难以识别语义荒谬之处以及理解他人的意图和心理状态。Claude2表现出与GPT - 3.5类似的模式,而Llama2在几乎所有测试中表现不佳。这些不一致的表现凸显了大语言模型的新兴能力尚未模仿人类认知功能。唯一的例外是GPT - 4,除了规划任务外,它在所有任务中都表现出在正常范围内。此外,我们展示了为评估人类认知功能而开发的标准化神经心理测试组合如何可能适用于挑战大语言模型的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1988/11490853/6b0f5aa05d66/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验