• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于 LLM 的本科医学教育自动简答题评分。

LLM-based automatic short answer grading in undergraduate medical education.

机构信息

Department of Life Sciences and Medicine, University of Luxembourg, 6, avenue de la Fonte, L-4364, Esch-sur-Alzette, Luxembourg.

出版信息

BMC Med Educ. 2024 Sep 27;24(1):1060. doi: 10.1186/s12909-024-06026-5.

DOI:10.1186/s12909-024-06026-5
PMID:39334087
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11429088/
Abstract

BACKGROUND

Multiple choice questions are heavily used in medical education assessments, but rely on recognition instead of knowledge recall. However, grading open questions is a time-intensive task for teachers. Automatic short answer grading (ASAG) has tried to fill this gap, and with the recent advent of Large Language Models (LLM), this branch has seen a new momentum.

METHODS

We graded 2288 student answers from 12 undergraduate medical education courses in 3 languages using GPT-4 and Gemini 1.0 Pro.

RESULTS

GPT-4 proposed significantly lower grades than the human evaluator, but reached low rates of false positives. The grades of Gemini 1.0 Pro were not significantly different from the teachers'. Both LLMs reached a moderate agreement with human grades, and a high precision for GPT-4 among answers considered fully correct. A consistent grading behavior could be determined for high-quality keys. A weak correlation was found wrt. the length or language of student answers. There is a risk of bias if the LLM knows the human grade a priori.

CONCLUSIONS

LLM-based ASAG applied to medical education still requires human oversight, but time can be spared on the edge cases, allowing teachers to focus on the middle ones. For Bachelor-level medical education questions, the training knowledge of LLMs seems to be sufficient, fine-tuning is thus not necessary.

摘要

背景

多项选择题在医学教育评估中被广泛使用,但依赖于识别而不是知识回忆。然而,给开放式问题评分对于教师来说是一项耗时的任务。自动简答题评分(ASAG)试图填补这一空白,并且随着最近大型语言模型(LLM)的出现,这一分支领域出现了新的势头。

方法

我们使用 GPT-4 和 Gemini 1.0 Pro 对来自 12 门本科医学教育课程的 2288 名学生的答案进行了评分。

结果

GPT-4 提出的成绩明显低于人类评估者,但错误率较低。Gemini 1.0 Pro 的成绩与教师的成绩没有显著差异。两种 LLM 与人类成绩的一致性均达到中等水平,并且对于被认为完全正确的答案,GPT-4 的准确率很高。对于高质量的关键内容,可以确定一致的评分行为。发现学生答案的长度或语言与成绩之间的相关性较弱。如果 LLM 事先知道人类的分数,就存在偏见的风险。

结论

应用于医学教育的基于 LLM 的 ASAG 仍然需要人工监督,但可以为边缘情况节省时间,使教师能够专注于中间情况。对于本科医学教育的问题,LLM 的训练知识似乎已经足够,因此不需要微调。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/f1134539f902/12909_2024_6026_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/8964b9af1472/12909_2024_6026_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/8174871eebbe/12909_2024_6026_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/87dbecfb6c19/12909_2024_6026_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/71f5a4648f26/12909_2024_6026_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/6593a8079d10/12909_2024_6026_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/262310a79a07/12909_2024_6026_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/4f148ba9a538/12909_2024_6026_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/fe93db9c98f0/12909_2024_6026_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/d0d676e147ff/12909_2024_6026_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/213c9409f393/12909_2024_6026_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/e9fa61dada94/12909_2024_6026_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/50b6e78af247/12909_2024_6026_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/f1134539f902/12909_2024_6026_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/8964b9af1472/12909_2024_6026_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/8174871eebbe/12909_2024_6026_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/87dbecfb6c19/12909_2024_6026_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/71f5a4648f26/12909_2024_6026_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/6593a8079d10/12909_2024_6026_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/262310a79a07/12909_2024_6026_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/4f148ba9a538/12909_2024_6026_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/fe93db9c98f0/12909_2024_6026_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/d0d676e147ff/12909_2024_6026_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/213c9409f393/12909_2024_6026_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/e9fa61dada94/12909_2024_6026_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/50b6e78af247/12909_2024_6026_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9b5/11429088/f1134539f902/12909_2024_6026_Fig13_HTML.jpg

相似文献

1
LLM-based automatic short answer grading in undergraduate medical education.基于 LLM 的本科医学教育自动简答题评分。
BMC Med Educ. 2024 Sep 27;24(1):1060. doi: 10.1186/s12909-024-06026-5.
2
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较:评估研究。
J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.
3
Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions.大语言模型在医学教育中的应用:比较 ChatGPT 与人工生成的考试题目。
Acad Med. 2024 May 1;99(5):508-512. doi: 10.1097/ACM.0000000000005626. Epub 2023 Dec 28.
4
Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量:评估研究
ArXiv. 2024 Jan 23:arXiv:2402.01693v1.
5
A Language Model-Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study.基于语言模型的模拟患者与自动化反馈的病史采集:前瞻性研究。
JMIR Med Educ. 2024 Aug 16;10:e59213. doi: 10.2196/59213.
6
Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study.评估生成式人工智能工具理解医学论文的能力:定性研究
JMIR Med Inform. 2024 Sep 4;12:e59258. doi: 10.2196/59258.
7
Performance of Large Language Models on Medical Oncology Examination Questions.大语言模型在医学肿瘤学考试问题上的表现。
JAMA Netw Open. 2024 Jun 3;7(6):e2417641. doi: 10.1001/jamanetworkopen.2024.17641.
8
Comparing the performance of artificial intelligence learning models to medical students in solving histology and embryology multiple choice questions.比较人工智能学习模型与医学生在解决组织学和胚胎学选择题方面的表现。
Ann Anat. 2024 Jun;254:152261. doi: 10.1016/j.aanat.2024.152261. Epub 2024 Mar 21.
9
Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources.大型语言模型和减重手术患者教育:GPT-3.5、GPT-4、Bard 与在线机构资源的可读性比较分析。
Surg Endosc. 2024 May;38(5):2522-2532. doi: 10.1007/s00464-024-10720-2. Epub 2024 Mar 12.
10
Appraisal of ChatGPT's Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination.评估 ChatGPT 在医学教育中的能力:与三年级医学生在肺病学考试中的比较分析。
JMIR Med Educ. 2024 Jul 23;10:e52818. doi: 10.2196/52818.

引用本文的文献

1
Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders.评估大型语言模型作为医学简答题评分者:与专家人工评分者的比较分析。
Med Educ Online. 2025 Dec;30(1):2550751. doi: 10.1080/10872981.2025.2550751. Epub 2025 Aug 24.
2
Student perceptions of GenAI as a virtual tutor to support collaborative research training for health professionals.学生对生成式人工智能作为虚拟导师以支持卫生专业人员协作研究培训的看法。
BMC Med Educ. 2025 Jul 1;25(1):895. doi: 10.1186/s12909-025-07390-6.
3
GPT-4's capabilities for formative and summative assessments in Norwegian medicine exams-an intrinsic case study in the early phase of intervention.

本文引用的文献

1
Smart grading: A generative AI-based tool for knowledge-grounded answer evaluation in educational assessments.智能评分:一种基于生成式人工智能的工具,用于教育评估中基于知识的答案评估。
MethodsX. 2023 Dec 20;12:102531. doi: 10.1016/j.mex.2023.102531. eCollection 2024 Jun.
2
ChatGPT for assessment writing.ChatGPT 用于评估写作。
Med Teach. 2023 Nov;45(11):1224-1227. doi: 10.1080/0142159X.2023.2249239. Epub 2023 Oct 16.
3
's first ChatGPT's referencing hallucinations: Lessons for editors, reviewers, and teachers.'s 首次引用幻觉:编辑、评审员和教师的教训。
GPT-4在挪威医学考试中的形成性和总结性评估能力——干预早期阶段的一项内在案例研究。
Front Med (Lausanne). 2025 Apr 10;12:1441747. doi: 10.3389/fmed.2025.1441747. eCollection 2025.
4
Evaluating the Effectiveness of Large Language Models in Providing Patient Education for Chinese Patients With Ocular Myasthenia Gravis: Mixed Methods Study.评估大语言模型为中国重症肌无力性眼病患者提供患者教育的有效性:混合方法研究
J Med Internet Res. 2025 Apr 10;27:e67883. doi: 10.2196/67883.
Med Teach. 2023 Jul;45(7):673-675. doi: 10.1080/0142159X.2023.2208731. Epub 2023 May 15.
4
Twelve tips for introducing very short answer questions (VSAQs) into your medical curriculum.将极短回答问题(VSAQs)引入医学课程的十二条建议。
Med Teach. 2023 Apr;45(4):360-367. doi: 10.1080/0142159X.2022.2093706. Epub 2022 Jul 14.
5
Uncovering students' misconceptions by assessment of their written questions.通过评估学生的书面问题来发现他们的误解。
BMC Med Educ. 2016 Aug 24;16(1):221. doi: 10.1186/s12909-016-0739-5.