Suppr超能文献

评估认知表现:传统方法与ChatGPT对比

Evaluating cognitive performance: Traditional methods vs. ChatGPT.

作者信息

Fei Xiao, Tang Ying, Zhang Jianan, Zhou Zhongkai, Yamamoto Ikuo, Zhang Yi

机构信息

Department of Rehabilitation Medicine, The First People's Hospital of Changzhou, Changzhou, China.

College of Information Science and Engineering, Hohai University, Changzhou, China.

出版信息

Digit Health. 2024 Aug 16;10:20552076241264639. doi: 10.1177/20552076241264639. eCollection 2024 Jan-Dec.

Abstract

BACKGROUND

NLP models like ChatGPT promise to revolutionize text-based content delivery, particularly in medicine. Yet, doubts remain about ChatGPT's ability to reliably support evaluations of cognitive performance, warranting further investigation into its accuracy and comprehensiveness in this area.

METHOD

A cohort of 60 cognitively normal individuals and 30 stroke survivors underwent a comprehensive evaluation, covering memory, numerical processing, verbal fluency, and abstract thinking. Healthcare professionals and NLP models GPT-3.5 and GPT-4 conducted evaluations following established standards. Scores were compared, and efforts were made to refine scoring protocols and interaction methods to enhance ChatGPT's potential in these evaluations.

RESULT

Within the cohort of healthy participants, the utilization of GPT-3.5 revealed significant disparities in memory evaluation compared to both physician-led assessments and those conducted utilizing GPT-4 ( < 0.001). Furthermore, within the domain of memory evaluation, GPT-3.5 exhibited discrepancies in 8 out of 21 specific measures when compared to assessments conducted by physicians ( < 0.05). Additionally, GPT-3.5 demonstrated statistically significant deviations from physician assessments in speech evaluation ( = 0.009). Among participants with a history of stroke, GPT-3.5 exhibited differences solely in verbal assessment compared to physician-led evaluations ( = 0.002). Notably, through the implementation of optimized scoring methodologies and refinement of interaction protocols, partial mitigation of these disparities was achieved.

CONCLUSION

ChatGPT can produce evaluation outcomes comparable to traditional methods. Despite differences from physician evaluations, refinement of scoring algorithms and interaction protocols has improved alignment. ChatGPT performs well even in populations with specific conditions like stroke, suggesting its versatility. GPT-4 yields results closer to physician ratings, indicating potential for further enhancement. These findings highlight ChatGPT's importance as a supplementary tool, offering new avenues for information gathering in medical fields and guiding its ongoing development and application.

摘要

背景

像ChatGPT这样的自然语言处理模型有望彻底改变基于文本的内容交付方式,尤其是在医学领域。然而,对于ChatGPT能否可靠地支持认知表现评估仍存在疑虑,因此有必要进一步研究其在该领域的准确性和全面性。

方法

对60名认知正常的个体和30名中风幸存者进行了全面评估,涵盖记忆、数字处理、语言流畅性和抽象思维。医疗保健专业人员以及自然语言处理模型GPT-3.5和GPT-4按照既定标准进行评估。比较了分数,并努力完善评分方案和交互方法,以增强ChatGPT在这些评估中的潜力。

结果

在健康参与者队列中,与医生主导的评估和使用GPT-4进行的评估相比,使用GPT-3.5进行记忆评估时发现了显著差异(<0.001)。此外,在记忆评估领域,与医生进行的评估相比,GPT-3.5在21项具体指标中的8项上表现出差异(<0.05)。此外,GPT-3.5在言语评估中与医生评估存在统计学上的显著偏差(=0.009)。在有中风病史的参与者中,与医生主导的评估相比,GPT-3.5仅在言语评估中表现出差异(=0.002)。值得注意的是,通过实施优化的评分方法和完善交互协议,这些差异得到了部分缓解。

结论

ChatGPT可以产生与传统方法相当的评估结果。尽管与医生评估存在差异,但评分算法和交互协议的完善提高了一致性。ChatGPT即使在中风等特定疾病人群中也表现良好,表明其具有通用性。GPT-4产生的结果更接近医生评分,表明有进一步改进的潜力。这些发现凸显了ChatGPT作为辅助工具的重要性,为医学领域的信息收集提供了新途径,并指导其持续发展和应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a273/11329975/8702c67d691c/10.1177_20552076241264639-fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验