Suppr超能文献

基于 LLM 的本科医学教育自动简答题评分。

LLM-based automatic short answer grading in undergraduate medical education.

机构信息

Department of Life Sciences and Medicine, University of Luxembourg, 6, avenue de la Fonte, L-4364, Esch-sur-Alzette, Luxembourg.

出版信息

BMC Med Educ. 2024 Sep 27;24(1):1060. doi: 10.1186/s12909-024-06026-5.

Abstract

BACKGROUND

Multiple choice questions are heavily used in medical education assessments, but rely on recognition instead of knowledge recall. However, grading open questions is a time-intensive task for teachers. Automatic short answer grading (ASAG) has tried to fill this gap, and with the recent advent of Large Language Models (LLM), this branch has seen a new momentum.

METHODS

We graded 2288 student answers from 12 undergraduate medical education courses in 3 languages using GPT-4 and Gemini 1.0 Pro.

RESULTS

GPT-4 proposed significantly lower grades than the human evaluator, but reached low rates of false positives. The grades of Gemini 1.0 Pro were not significantly different from the teachers'. Both LLMs reached a moderate agreement with human grades, and a high precision for GPT-4 among answers considered fully correct. A consistent grading behavior could be determined for high-quality keys. A weak correlation was found wrt. the length or language of student answers. There is a risk of bias if the LLM knows the human grade a priori.

CONCLUSIONS

LLM-based ASAG applied to medical education still requires human oversight, but time can be spared on the edge cases, allowing teachers to focus on the middle ones. For Bachelor-level medical education questions, the training knowledge of LLMs seems to be sufficient, fine-tuning is thus not necessary.

摘要

背景

多项选择题在医学教育评估中被广泛使用,但依赖于识别而不是知识回忆。然而,给开放式问题评分对于教师来说是一项耗时的任务。自动简答题评分(ASAG)试图填补这一空白,并且随着最近大型语言模型(LLM)的出现,这一分支领域出现了新的势头。

方法

我们使用 GPT-4 和 Gemini 1.0 Pro 对来自 12 门本科医学教育课程的 2288 名学生的答案进行了评分。

结果

GPT-4 提出的成绩明显低于人类评估者,但错误率较低。Gemini 1.0 Pro 的成绩与教师的成绩没有显著差异。两种 LLM 与人类成绩的一致性均达到中等水平,并且对于被认为完全正确的答案,GPT-4 的准确率很高。对于高质量的关键内容,可以确定一致的评分行为。发现学生答案的长度或语言与成绩之间的相关性较弱。如果 LLM 事先知道人类的分数,就存在偏见的风险。

结论

应用于医学教育的基于 LLM 的 ASAG 仍然需要人工监督,但可以为边缘情况节省时间,使教师能够专注于中间情况。对于本科医学教育的问题,LLM 的训练知识似乎已经足够,因此不需要微调。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/8964b9af1472/12909_2024_6026_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验