• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

运用人类工具挑战大型语言模型的“”:一项关于意大利语前额叶功能的神经心理学研究。 注:原文中“Challenging large language models' "" with human tools”这里双引号里内容缺失,翻译可能不太准确,需结合完整原文进一步理解。

Challenging large language models' "" with human tools: A neuropsychological investigation in Italian language on prefrontal functioning.

作者信息

Loconte Riccardo, Orrù Graziella, Tribastone Mirco, Pietrini Pietro, Sartori Giuseppe

机构信息

Molecular Mind Lab, IMT School of Advanced Studies Lucca, Lucca, Italy.

University of Pisa, Pisa, Italy.

出版信息

Heliyon. 2024 Oct 3;10(19):e38911. doi: 10.1016/j.heliyon.2024.e38911. eCollection 2024 Oct 15.

DOI:10.1016/j.heliyon.2024.e38911
PMID:39430451
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11490853/
Abstract

The Artificial Intelligence (AI) research community has used ad-hoc benchmarks to measure the "" level of Large Language Models (LLMs). In humans, intelligence is closely linked to the functional integrity of the prefrontal lobes, which are essential for higher-order cognitive processes. Previous research has found that LLMs struggle with cognitive tasks that rely on these prefrontal functions, highlighting a significant challenge in replicating human-like intelligence. In December 2022, OpenAI released ChatGPT, a new chatbot based on the GPT-3.5 model that quickly gained popularity for its impressive ability to understand and respond to human instructions, suggesting a significant step towards intelligent behaviour in AI. Therefore, to rigorously investigate LLMs' level of "," we evaluated the GPT-3.5 and GPT-4 versions through a neuropsychological assessment using tests in the Italian language routinely employed to assess prefrontal functioning in humans. The same tests were also administered to Claude2 and Llama2 to verify whether similar language models perform similarly in prefrontal tests. When using human performance as a reference, GPT-3.5 showed inhomogeneous results on prefrontal tests, with some tests well above average, others in the lower range, and others frankly impaired. Specifically, we have identified poor planning abilities and difficulty in recognising semantic absurdities and understanding others' intentions and mental states. Claude2 exhibited a similar pattern to GPT-3.5, while Llama2 performed poorly in almost all tests. These inconsistent profiles highlight how LLMs' emergent abilities do not yet mimic human cognitive functioning. The sole exception was GPT-4, which performed within the normative range for all the tasks except planning. Furthermore, we showed how standardised neuropsychological batteries developed to assess human cognitive functions may be suitable for challenging LLMs' performance.

摘要

人工智能(AI)研究界一直使用临时基准来衡量大语言模型(LLMs)的“智能水平”。在人类中,智力与前额叶的功能完整性密切相关,前额叶对于高阶认知过程至关重要。先前的研究发现,大语言模型在依赖这些前额叶功能的认知任务上存在困难,这凸显了在复制类人智能方面的重大挑战。2022年12月,OpenAI发布了ChatGPT,这是一个基于GPT - 3.5模型的新聊天机器人,因其理解和响应人类指令的出色能力迅速受到欢迎,这表明在人工智能的智能行为方面迈出了重要一步。因此,为了严格调查大语言模型的“智能水平”,我们通过使用意大利语测试进行神经心理学评估,对GPT - 3.5和GPT - 4版本进行了评估,这些测试通常用于评估人类的前额叶功能。同样的测试也应用于Claude2和Llama2,以验证类似的语言模型在前额叶测试中是否表现相似。以人类表现作为参考时,GPT - 3.5在前额叶测试中表现出不均匀的结果,一些测试远高于平均水平,一些处于较低范围,还有一些明显受损。具体而言,我们发现其规划能力较差,难以识别语义荒谬之处以及理解他人的意图和心理状态。Claude2表现出与GPT - 3.5类似的模式,而Llama2在几乎所有测试中表现不佳。这些不一致的表现凸显了大语言模型的新兴能力尚未模仿人类认知功能。唯一的例外是GPT - 4,除了规划任务外,它在所有任务中都表现出在正常范围内。此外,我们展示了为评估人类认知功能而开发的标准化神经心理测试组合如何可能适用于挑战大语言模型的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1988/11490853/1380e134e4d3/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1988/11490853/6b0f5aa05d66/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1988/11490853/1380e134e4d3/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1988/11490853/6b0f5aa05d66/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1988/11490853/1380e134e4d3/gr2.jpg

相似文献

1
Challenging large language models' "" with human tools: A neuropsychological investigation in Italian language on prefrontal functioning.运用人类工具挑战大型语言模型的“”:一项关于意大利语前额叶功能的神经心理学研究。 注:原文中“Challenging large language models' "" with human tools”这里双引号里内容缺失,翻译可能不太准确,需结合完整原文进一步理解。
Heliyon. 2024 Oct 3;10(19):e38911. doi: 10.1016/j.heliyon.2024.e38911. eCollection 2024 Oct 15.
2
Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.评估大型语言模型与人类心理健康整合价值观的一致性:使用施瓦茨基本价值观理论的横断面研究。
JMIR Ment Health. 2024 Apr 9;11:e55988. doi: 10.2196/55988.
3
Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.评估生成式人工智能工具理解医学论文的能力:定性研究
JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.
4
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.评估印度全国医预考用大型语言模型:GPT-3.5、GPT-4 和 Bard 的比较分析。
JMIR Med Educ. 2024 Feb 21;10:e51523. doi: 10.2196/51523.
5
Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较:大型语言模型、ChatGPT 和未经训练的急诊医生:一项对比研究。
J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.
6
Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard.大型语言模型在造血干细胞移植导航中对医疗保健专业人员和患者的实用性:ChatGPT-3.5、ChatGPT-4 和 Bard 的性能比较。
J Med Internet Res. 2024 May 17;26:e54758. doi: 10.2196/54758.
7
Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2.评估大语言模型在乳腺癌临床场景中的应用:基于 ChatGPT-3.5、ChatGPT-4.0 和 Claude2 的比较分析
Int J Surg. 2024 Apr 1;110(4):1941-1950. doi: 10.1097/JS9.0000000000001066.
8
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
9
Capacity of Generative AI to Interpret Human Emotions From Visual and Textual Data: Pilot Evaluation Study.生成式人工智能从视觉和文本数据中解读人类情感的能力:初步评估研究。
JMIR Ment Health. 2024 Feb 6;11:e54369. doi: 10.2196/54369.
10
Artificial Intelligence for Anesthesiology Board-Style Examination Questions: Role of Large Language Models.人工智能在麻醉学 board 式考试问题中的应用:大语言模型的作用。
J Cardiothorac Vasc Anesth. 2024 May;38(5):1251-1259. doi: 10.1053/j.jvca.2024.01.032. Epub 2024 Feb 1.

引用本文的文献

1
The presentation of self in the age of ChatGPT.ChatGPT时代的自我呈现。
Front Sociol. 2025 Aug 21;10:1614473. doi: 10.3389/fsoc.2025.1614473. eCollection 2025.
2
Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions.评估大型语言模型在回答神经生理学问题方面的优缺点。
Sci Rep. 2024 May 11;14(1):10785. doi: 10.1038/s41598-024-60405-y.

本文引用的文献

1
Beyond the limitations of any imaginable mechanism: Large language models and psycholinguistics.超越任何可想象机制的局限:大语言模型与心理语言学。
Behav Brain Sci. 2023 Dec 6;46:e395. doi: 10.1017/S0140525X23001693.
2
Language models and psychological sciences.语言模型与心理科学。
Front Psychol. 2023 Oct 20;14:1279317. doi: 10.3389/fpsyg.2023.1279317. eCollection 2023.
3
Emergent analogical reasoning in large language models.大语言模型中的紧急类比推理。
Nat Hum Behav. 2023 Sep;7(9):1526-1541. doi: 10.1038/s41562-023-01659-w. Epub 2023 Jul 31.
4
Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers.使用检测器和不知情的人类评审员,将ChatGPT生成的科学摘要与真实摘要进行比较。
NPJ Digit Med. 2023 Apr 26;6(1):75. doi: 10.1038/s41746-023-00819-6.
5
Using cognitive psychology to understand GPT-3.利用认知心理学理解 GPT-3。
Proc Natl Acad Sci U S A. 2023 Feb 7;120(6):e2218523120. doi: 10.1073/pnas.2218523120. Epub 2023 Feb 2.
6
Rapamycin in the context of Pascal's Wager: generative pre-trained transformer perspective.帕斯卡赌注视角下的雷帕霉素:生成式预训练变换器观点。
Oncoscience. 2022 Dec 21;9:82-84. doi: 10.18632/oncoscience.571. eCollection 2022.
7
Open artificial intelligence platforms in nursing education: Tools for academic progress or abuse?护理教育中的开放人工智能平台:学术进步的工具还是滥用的手段?
Nurse Educ Pract. 2023 Jan;66:103537. doi: 10.1016/j.nepr.2022.103537. Epub 2022 Dec 16.
8
Executive Function, Working Memory, and Verbal Fluency in Relation to Non-Verbal Intelligence in Greek-Speaking School-Age Children with Developmental Language Disorder.希腊语学龄发育性语言障碍儿童的执行功能、工作记忆和言语流畅性与非言语智力的关系
Brain Sci. 2021 May 8;11(5):604. doi: 10.3390/brainsci11050604.
9
Italian Normative Data for the Original Version of the Tower of London Test: A Bivariate Analysis on Speed and Accuracy Scores.意大利原版伦敦塔测验的规范数据:速度和准确性得分的二元分析。
Assessment. 2022 Mar;29(2):209-224. doi: 10.1177/1073191120961834. Epub 2020 Sep 29.
10
Are All Remote Associates Tests Equal? An Overview of the Remote Associates Test in Different Languages.所有的远程联想测验都一样吗?不同语言的远程联想测验概述。
Front Psychol. 2020 Jun 30;11:1125. doi: 10.3389/fpsyg.2020.01125. eCollection 2020.