Loconte Riccardo, Orrù Graziella, Tribastone Mirco, Pietrini Pietro, Sartori Giuseppe
Molecular Mind Lab, IMT School of Advanced Studies Lucca, Lucca, Italy.
University of Pisa, Pisa, Italy.
Heliyon. 2024 Oct 3;10(19):e38911. doi: 10.1016/j.heliyon.2024.e38911. eCollection 2024 Oct 15.
The Artificial Intelligence (AI) research community has used ad-hoc benchmarks to measure the "" level of Large Language Models (LLMs). In humans, intelligence is closely linked to the functional integrity of the prefrontal lobes, which are essential for higher-order cognitive processes. Previous research has found that LLMs struggle with cognitive tasks that rely on these prefrontal functions, highlighting a significant challenge in replicating human-like intelligence. In December 2022, OpenAI released ChatGPT, a new chatbot based on the GPT-3.5 model that quickly gained popularity for its impressive ability to understand and respond to human instructions, suggesting a significant step towards intelligent behaviour in AI. Therefore, to rigorously investigate LLMs' level of "," we evaluated the GPT-3.5 and GPT-4 versions through a neuropsychological assessment using tests in the Italian language routinely employed to assess prefrontal functioning in humans. The same tests were also administered to Claude2 and Llama2 to verify whether similar language models perform similarly in prefrontal tests. When using human performance as a reference, GPT-3.5 showed inhomogeneous results on prefrontal tests, with some tests well above average, others in the lower range, and others frankly impaired. Specifically, we have identified poor planning abilities and difficulty in recognising semantic absurdities and understanding others' intentions and mental states. Claude2 exhibited a similar pattern to GPT-3.5, while Llama2 performed poorly in almost all tests. These inconsistent profiles highlight how LLMs' emergent abilities do not yet mimic human cognitive functioning. The sole exception was GPT-4, which performed within the normative range for all the tasks except planning. Furthermore, we showed how standardised neuropsychological batteries developed to assess human cognitive functions may be suitable for challenging LLMs' performance.
人工智能(AI)研究界一直使用临时基准来衡量大语言模型(LLMs)的“智能水平”。在人类中,智力与前额叶的功能完整性密切相关,前额叶对于高阶认知过程至关重要。先前的研究发现,大语言模型在依赖这些前额叶功能的认知任务上存在困难,这凸显了在复制类人智能方面的重大挑战。2022年12月,OpenAI发布了ChatGPT,这是一个基于GPT - 3.5模型的新聊天机器人,因其理解和响应人类指令的出色能力迅速受到欢迎,这表明在人工智能的智能行为方面迈出了重要一步。因此,为了严格调查大语言模型的“智能水平”,我们通过使用意大利语测试进行神经心理学评估,对GPT - 3.5和GPT - 4版本进行了评估,这些测试通常用于评估人类的前额叶功能。同样的测试也应用于Claude2和Llama2,以验证类似的语言模型在前额叶测试中是否表现相似。以人类表现作为参考时,GPT - 3.5在前额叶测试中表现出不均匀的结果,一些测试远高于平均水平,一些处于较低范围,还有一些明显受损。具体而言,我们发现其规划能力较差,难以识别语义荒谬之处以及理解他人的意图和心理状态。Claude2表现出与GPT - 3.5类似的模式,而Llama2在几乎所有测试中表现不佳。这些不一致的表现凸显了大语言模型的新兴能力尚未模仿人类认知功能。唯一的例外是GPT - 4,除了规划任务外,它在所有任务中都表现出在正常范围内。此外,我们展示了为评估人类认知功能而开发的标准化神经心理测试组合如何可能适用于挑战大语言模型的性能。