Suppr超能文献

年龄与大语言模型对认知障碍的机器易感性:横断面分析

Age against the machine-susceptibility of large language models to cognitive impairment: cross sectional analysis.

作者信息

Dayan Roy, Uliel Benjamin, Koplewitz Gal

机构信息

Department of Neurology, Hadassah Medical Center, Jerusalem, Israel.

Faculty of Medicine, Hebrew University, Jerusalem, Israel.

出版信息

BMJ. 2024 Dec 19;387:e081948. doi: 10.1136/bmj-2024-081948.

Abstract

OBJECTIVE

To evaluate the cognitive abilities of the leading large language models and identify their susceptibility to cognitive impairment, using the Montreal Cognitive Assessment (MoCA) and additional tests.

DESIGN

Cross sectional analysis.

SETTING

Online interaction with large language models via text based prompts.

PARTICIPANTS

Publicly available large language models, or "chatbots": ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 "Sonnet" (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet).

ASSESSMENTS

The MoCA test (version 8.1) was administered to the leading large language models with instructions identical to those given to human patients. Scoring followed official guidelines and was evaluated by a practising neurologist. Additional assessments included the Navon figure, cookie theft picture, Poppelreuter figure, and Stroop test.

MAIN OUTCOME MEASURES

MoCA scores, performance in visuospatial/executive tasks, and Stroop test results.

RESULTS

ChatGPT 4o achieved the highest score on the MoCA test (26/30), followed by ChatGPT 4 and Claude (25/30), with Gemini 1.0 scoring lowest (16/30). All large language models showed poor performance in visuospatial/executive tasks. Gemini models failed at the delayed recall task. Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test.

CONCLUSIONS

With the exception of ChatGPT 4o, almost all large language models subjected to the MoCA test showed signs of mild cognitive impairment. Moreover, as in humans, age is a key determinant of cognitive decline: "older" chatbots, like older patients, tend to perform worse on the MoCA test. These findings challenge the assumption that artificial intelligence will soon replace human doctors, as the cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients' confidence.

摘要

目的

使用蒙特利尔认知评估量表(MoCA)及其他测试,评估主流大语言模型的认知能力,并确定它们对认知障碍的易感性。

设计

横断面分析。

设置

通过基于文本的提示与大语言模型进行在线交互。

参与者

公开可用的大语言模型,即“聊天机器人”:ChatGPT版本4和4o(由OpenAI开发)、Claude 3.5“十四行诗”(由Anthropic开发)以及Gemini版本1和1.5(由Alphabet开发)。

评估

向主流大语言模型施测MoCA测试(8.1版),其指导语与给予人类患者的相同。评分遵循官方指南,并由一名执业神经科医生进行评估。额外评估包括纳冯图形、画钟试验、波普洛依特图形以及斯特鲁普测试。

主要观察指标

MoCA分数、视觉空间/执行任务表现以及斯特鲁普测试结果。

结果

ChatGPT 4o在MoCA测试中得分最高(26/30),其次是ChatGPT 4和Claude(25/30),Gemini 1.0得分最低(16/30)。所有大语言模型在视觉空间/执行任务中表现不佳。Gemini模型在延迟回忆任务中失败。只有ChatGPT 4o在斯特鲁普测试的不一致阶段成功完成。

结论

除ChatGPT 4o外,几乎所有接受MoCA测试的大语言模型都表现出轻度认知障碍的迹象。此外,与人类一样,年龄是认知衰退的关键决定因素:“较老”的聊天机器人,就像老年患者一样,在MoCA测试中往往表现更差。这些发现挑战了人工智能将很快取代人类医生的假设,因为主流聊天机器人中明显的认知障碍可能会影响它们在医学诊断中的可靠性,并削弱患者的信心。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ea1c/12128858/e35852375e1d/dayr081948.f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验