Tarris Georges, Martin Laurent
Department of Pathology, University Hospital François Mitterrand of Dijon-Bourgogne, Dijon, France.
University of Burgundy Health Sciences Center, Dijon, France.
Digit Health. 2025 Jan 31;11:20552076241310630. doi: 10.1177/20552076241310630. eCollection 2025 Jan-Dec.
Digital teaching diversifies the ways of knowledge assessment, as natural language processing offers the possibility of answering questions posed by students and teachers.
This study evaluated ChatGPT's, Bard's and Gemini's performances on second year of medical studies' (DFGSM2) Pathology exams from the Health Sciences Center of Dijon (France) in 2018-2022.
From 2018 to 2022, exam scores, discriminating powers and discordance rates were retrieved. Seventy questions (25 first-order single response questions and 45 second-order multiple response questions) were submitted on May 2023 to ChatGPT 3.5 and Bard 2.0, and on September 2024 to Gemini 1.5 and ChatGPT-4. Chatbot's and student's average scores were compared, as well as discriminating powers of questions answered by chatbots. The percentage of student-chatbot identical answers was retrieved, and linear regression analysis correlated the scores of chatbots with student's discordance rates. Chatbot's reliability was assessed by submitting the questions in four successive rounds and comparing score variability using a Fleiss' Kappa and a Cohen's Kappa.
Newer chatbots outperformed both students and older chatbots as for the overall scores and multiple-response questions. All chatbots outperformed students on less discriminating questions. Oppositely, all chatbots were outperformed by students to questions with a high discriminating power. Chatbot's scores were correlated to student discordance rates. ChatGPT 4 and Gemini 1.5 provided variable answers, due to effects linked to prompt engineering.
Our study in line with the literature confirms chatbot's moderate performance for questions requiring complex reasoning, with ChatGPT outperforming Google chatbots. The use of NLP software based on distributional semantics remains a challenge for the generation of questions in French. Drawbacks to the use of NLP software in generating questions include the generation of hallucinations and erroneous medical knowledge which have to be taken into count when using NLP software in medical education.
数字教学使知识评估方式多样化,因为自然语言处理提供了回答学生和教师提出问题的可能性。
本研究评估了ChatGPT、Bard和Gemini在2018 - 2022年法国第戎健康科学中心医学二年级病理学考试(DFGSM2)中的表现。
检索2018年至2022年的考试成绩、区分能力和不一致率。2023年5月向ChatGPT 3.5和Bard 2.0提交了70道题(25道一阶单应答题和45道二阶多应答题),2024年9月向Gemini 1.5和ChatGPT - 4提交了这些题。比较了聊天机器人和学生的平均成绩,以及聊天机器人回答问题的区分能力。检索了学生与聊天机器人相同答案的百分比,并通过线性回归分析将聊天机器人的分数与学生的不一致率相关联。通过连续四轮提交问题并使用Fleiss' Kappa和Cohen's Kappa比较分数变异性来评估聊天机器人的可靠性。
在总体分数和多应答问题方面,较新的聊天机器人表现优于学生和较旧的聊天机器人。在区分度较低的问题上,所有聊天机器人的表现都优于学生。相反,在区分能力高的问题上,所有聊天机器人的表现都不如学生。聊天机器人的分数与学生的不一致率相关。由于与提示工程相关的影响,ChatGPT 4和Gemini 1.5提供的答案存在差异。
我们与文献一致的研究证实,对于需要复杂推理的问题,聊天机器人的表现中等,ChatGPT的表现优于谷歌聊天机器人。基于分布语义的自然语言处理软件在法语问题生成方面仍然是一个挑战。在生成问题时使用自然语言处理软件的缺点包括产生幻觉和错误的医学知识,在医学教育中使用自然语言处理软件时必须考虑到这些问题。