Suppr超能文献

ChatGPT 4、ChatGPT 3.5、Gemini Advanced Pro 1.5和Bard 2.0解决法语病理学问题的性能评估。

Performance assessment of ChatGPT 4, ChatGPT 3.5, Gemini Advanced Pro 1.5 and Bard 2.0 to problem solving in pathology in French language.

作者信息

Tarris Georges, Martin Laurent

机构信息

Department of Pathology, University Hospital François Mitterrand of Dijon-Bourgogne, Dijon, France.

University of Burgundy Health Sciences Center, Dijon, France.

出版信息

Digit Health. 2025 Jan 31;11:20552076241310630. doi: 10.1177/20552076241310630. eCollection 2025 Jan-Dec.

Abstract

UNLABELLED

Digital teaching diversifies the ways of knowledge assessment, as natural language processing offers the possibility of answering questions posed by students and teachers.

OBJECTIVE

This study evaluated ChatGPT's, Bard's and Gemini's performances on second year of medical studies' (DFGSM2) Pathology exams from the Health Sciences Center of Dijon (France) in 2018-2022.

METHODS

From 2018 to 2022, exam scores, discriminating powers and discordance rates were retrieved. Seventy questions (25 first-order single response questions and 45 second-order multiple response questions) were submitted on May 2023 to ChatGPT 3.5 and Bard 2.0, and on September 2024 to Gemini 1.5 and ChatGPT-4. Chatbot's and student's average scores were compared, as well as discriminating powers of questions answered by chatbots. The percentage of student-chatbot identical answers was retrieved, and linear regression analysis correlated the scores of chatbots with student's discordance rates. Chatbot's reliability was assessed by submitting the questions in four successive rounds and comparing score variability using a Fleiss' Kappa and a Cohen's Kappa.

RESULTS

Newer chatbots outperformed both students and older chatbots as for the overall scores and multiple-response questions. All chatbots outperformed students on less discriminating questions. Oppositely, all chatbots were outperformed by students to questions with a high discriminating power. Chatbot's scores were correlated to student discordance rates. ChatGPT 4 and Gemini 1.5 provided variable answers, due to effects linked to prompt engineering.

CONCLUSION

Our study in line with the literature confirms chatbot's moderate performance for questions requiring complex reasoning, with ChatGPT outperforming Google chatbots. The use of NLP software based on distributional semantics remains a challenge for the generation of questions in French. Drawbacks to the use of NLP software in generating questions include the generation of hallucinations and erroneous medical knowledge which have to be taken into count when using NLP software in medical education.

摘要

未标注

数字教学使知识评估方式多样化,因为自然语言处理提供了回答学生和教师提出问题的可能性。

目的

本研究评估了ChatGPT、Bard和Gemini在2018 - 2022年法国第戎健康科学中心医学二年级病理学考试(DFGSM2)中的表现。

方法

检索2018年至2022年的考试成绩、区分能力和不一致率。2023年5月向ChatGPT 3.5和Bard 2.0提交了70道题(25道一阶单应答题和45道二阶多应答题),2024年9月向Gemini 1.5和ChatGPT - 4提交了这些题。比较了聊天机器人和学生的平均成绩,以及聊天机器人回答问题的区分能力。检索了学生与聊天机器人相同答案的百分比,并通过线性回归分析将聊天机器人的分数与学生的不一致率相关联。通过连续四轮提交问题并使用Fleiss' Kappa和Cohen's Kappa比较分数变异性来评估聊天机器人的可靠性。

结果

在总体分数和多应答问题方面,较新的聊天机器人表现优于学生和较旧的聊天机器人。在区分度较低的问题上,所有聊天机器人的表现都优于学生。相反,在区分能力高的问题上,所有聊天机器人的表现都不如学生。聊天机器人的分数与学生的不一致率相关。由于与提示工程相关的影响,ChatGPT 4和Gemini 1.5提供的答案存在差异。

结论

我们与文献一致的研究证实,对于需要复杂推理的问题,聊天机器人的表现中等,ChatGPT的表现优于谷歌聊天机器人。基于分布语义的自然语言处理软件在法语问题生成方面仍然是一个挑战。在生成问题时使用自然语言处理软件的缺点包括产生幻觉和错误的医学知识,在医学教育中使用自然语言处理软件时必须考虑到这些问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26a8/11786284/8adb5da18014/10.1177_20552076241310630-fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验