比较大语言模型与医学教育形成性评估教师评分的一致性

Comparing Scoring Consistency of Large Language Models with Faculty for Formative Assessments in Medical Education.

作者信息

Sreedhar Radhika, Chang Linda, Gangopadhyaya Ananya, Shiels Peggy Woziwodzki, Loza Julie, Chi Euna, Gabel Elizabeth, Park Yoon Soo

机构信息

University of Illinois College of Medicine, Chicago, IL, USA.

出版信息

J Gen Intern Med. 2025 Jan;40(1):127-134. doi: 10.1007/s11606-024-09050-9. Epub 2024 Oct 14.

DOI:10.1007/s11606-024-09050-9

PMID:39402411

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11780228/

Abstract

BACKGROUND

The Liaison Committee on Medical Education requires that medical students receive individualized feedback on their self-directed learning skills. Pre-clinical students are asked to complete multiple spaced critical appraisal assignments. However, the individual feedback requires significant faculty time. As large language models (LLMs) can score and generate feedback, we explored their use in grading formative assessments through validity and feasibility lenses.

OBJECTIVE

To explore the consistency and feasibility of using an LLM to assess and provide feedback for formative assessments in undergraduate medical education.

DESIGN AND PARTICIPANTS

This was a cross-sectional study of pre-clinical students' critical appraisal assignments at University of Illinois College of Medicine (UICOM) during the 2022-2023 academic year.

INTERVENTION

An initial sample of ten assignments was used to develop a prompt. For each student entry, the de-identified assignment and prompt were provided to ChatGPT 3.5, and its scoring was compared to the existing faculty grade.

MAIN MEASURES

Differences in scoring of individual items between ChatGPT and faculty were assessed. Scoring consistency using inter-rater reliability (IRR) was calculated as percent exact agreement. Chi-squared test was used to determine if there were significant differences in scores. Psychometric characteristics including internal-consistency reliability, area under precision-recall curve (AUCPR), and cost were studied.

KEY RESULTS

In this cross-sectional study, 111 pre-clinical students' faculty graded assignments were compared with that of ChatGPT and the scoring of individual items was comparable. The overall agreement between ChatGPT and faculty was 67% (OR = 2.53, P < 0.001); mean AUCPR was 0.69 (range 0.61-0.76). Internal-consistency reliability of ChatGPT was 0.64 and its use resulted in a fivefold reduction in faculty time, and potential savings of 150 faculty hours.

CONCLUSIONS

This study of psychometric characteristics of ChatGPT demonstrates the potential role for LLMs to assist faculty in assessing and providing feedback for formative assignments.

摘要

背景

医学教育联络委员会要求医学生获得关于其自主学习技能的个性化反馈。临床前学生需要完成多项间隔的批判性评估作业。然而，个性化反馈需要教师花费大量时间。由于大语言模型（LLMs）可以进行评分并生成反馈，我们从有效性和可行性的角度探讨了它们在形成性评估评分中的应用。

目的

探讨使用大语言模型评估本科医学教育形成性评估并提供反馈的一致性和可行性。

设计与参与者

这是一项对伊利诺伊大学医学院（UICOM）2022 - 2023学年临床前学生批判性评估作业的横断面研究。

干预

最初选取十份作业样本用于制定提示。对于每个学生的作业，将去识别化的作业和提示提供给ChatGPT 3.5，并将其评分与现有的教师评分进行比较。

主要测量指标

评估ChatGPT和教师在单个项目评分上的差异。使用评分者间信度（IRR）计算评分一致性，以精确一致百分比表示。使用卡方检验确定分数是否存在显著差异。研究了包括内部一致性信度、精确召回曲线下面积（AUCPR）和成本在内的心理测量特征。

关键结果

在这项横断面研究中，将111名临床前学生的教师评分作业与ChatGPT的评分进行了比较，单个项目的评分具有可比性。ChatGPT与教师之间的总体一致性为67%（OR = 2.53，P < 0.001）；平均AUCPR为0.69（范围0.61 - 0.76）。ChatGPT的内部一致性信度为0.64，其使用使教师时间减少了五倍，并可能节省150个教师工时。

结论

这项对ChatGPT心理测量特征的研究表明，大语言模型在协助教师评估形成性作业并提供反馈方面具有潜在作用。

相似文献

Comparing Scoring Consistency of Large Language Models with Faculty for Formative Assessments in Medical Education.比较大语言模型与医学教育形成性评估教师评分的一致性

J Gen Intern Med. 2025 Jan;40(1):127-134. doi: 10.1007/s11606-024-09050-9. Epub 2024 Oct 14.

The educational effects of portfolios on undergraduate student learning: a Best Evidence Medical Education (BEME) systematic review. BEME Guide No. 11.档案袋对本科学生学习的教育效果：最佳证据医学教育（BEME）系统评价。BEME指南第11号。

Med Teach. 2009 Apr;31(4):282-98. doi: 10.1080/01421590902889897.

Validation of Checklists and Evaluation of Clinical Skills in Cases of Abdominal Pain With Simulation in Formative, Objective, Structured Clinical Examination With Audiovisual Content in Third-Year Medical Students' Surgical Clerkship.在第三年医学生外科实习中，使用形成性、客观、结构化临床考试中的模拟病例进行腹痛检查表验证和临床技能评估，同时具有视听内容。

J Surg Educ. 2024 Nov;81(11):1756-1763. doi: 10.1016/j.jsurg.2024.08.016. Epub 2024 Sep 20.

A New Measure of Quantified Social Health Is Associated With Levels of Discomfort, Capability, and Mental and General Health Among Patients Seeking Musculoskeletal Specialty Care.一种新的量化社会健康指标与寻求肌肉骨骼专科护理的患者的不适程度、能力以及心理和总体健康水平相关。

Clin Orthop Relat Res. 2025 Apr 1;483(4):647-663. doi: 10.1097/CORR.0000000000003394. Epub 2025 Feb 5.

Readability analysis as a tool for evaluating English proficiency in first-year medical students.可读性分析作为评估一年级医学生英语水平的一种工具。

BMC Med Educ. 2025 Jul 1;25(1):945. doi: 10.1186/s12909-025-07348-8.

How Internal Medicine Clerkship Directors Are Using Entrustable Professional Activities: A National Survey Study.内科实习主任如何运用可托付专业活动：一项全国性调查研究。

J Gen Intern Med. 2025 Jan;40(1):46-53. doi: 10.1007/s11606-024-08991-5. Epub 2024 Aug 15.

The viva voce innovation and experience at a new medical school in Rwanda.卢旺达一所新医学院的口试创新与经验。

BMC Med Educ. 2025 Jul 1;25(1):913. doi: 10.1186/s12909-025-07463-6.

Surgeons vs ChatGPT: Assessment and Feedback Performance Based on Real Surgical Scenarios.外科医生与 ChatGPT：基于真实手术场景的评估和反馈表现。

J Surg Educ. 2024 Jul;81(7):960-966. doi: 10.1016/j.jsurg.2024.03.012. Epub 2024 May 14.

Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施：系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。

Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.

The measurement and monitoring of surgical adverse events.手术不良事件的测量与监测

Health Technol Assess. 2001;5(22):1-194. doi: 10.3310/hta5220.

引用本文的文献

Letter to the Editor Regarding "Comparing Scoring Consistency of Large Language Models with Faculty for Formative Assessments in Medical Education" by Sreedhar R et al.致编辑的信：关于Sreedhar R等人所著的《比较大语言模型与医学教育形成性评估中教师评分的一致性》

J Gen Intern Med. 2025 Jul;40(9):2139. doi: 10.1007/s11606-025-09507-5. Epub 2025 Apr 11.

本文引用的文献

Diabetes and artificial intelligence beyond the closed loop: a review of the landscape, promise and challenges.糖尿病与闭环之外的人工智能：现状、前景与挑战综述。

Diabetologia. 2024 Feb;67(2):223-235. doi: 10.1007/s00125-023-06038-8. Epub 2023 Nov 18.

What have we learned about constructed response short-answer questions from students and faculty? A multi-institutional study.我们从学生和教师那里了解到了哪些关于构造性反应简答题的信息？一项多机构研究。

Med Teach. 2024 Mar;46(3):349-358. doi: 10.1080/0142159X.2023.2249209. Epub 2023 Sep 9.

Use of very short answer questions compared to multiple choice questions in undergraduate medical students: An external validation study.采用简答题而非选择题对医学生进行测试：一项外部验证研究。

PLoS One. 2023 Jul 14;18(7):e0288558. doi: 10.1371/journal.pone.0288558. eCollection 2023.

The Pros and Cons of Using ChatGPT in Medical Education: A Scoping Review.使用 ChatGPT 在医学教育中的利弊：范围综述。

Stud Health Technol Inform. 2023 Jun 29;305:644-647. doi: 10.3233/SHTI230580.

A step-by-step researcher's guide to the use of an AI-based transformer in epidemiology: an exploratory analysis of ChatGPT using the STROBE checklist for observational studies.研究人员使用基于人工智能的变换器进行流行病学研究的分步指南：使用观察性研究的STROBE清单对ChatGPT进行探索性分析

Z Gesundh Wiss. 2023 May 26:1-36. doi: 10.1007/s10389-023-01936-y.

ChatGPT in Medical Education: a Paradigm Shift or a Dangerous Tool?ChatGPT在医学教育中的应用：是范式转变还是危险工具？

Acad Psychiatry. 2023 Aug;47(4):439-440. doi: 10.1007/s40596-023-01791-9. Epub 2023 May 9.

Artificial intelligence in scientific writing: a friend or a foe?人工智能在科学写作中的应用：是敌是友？

Reprod Biomed Online. 2023 Jul;47(1):3-9. doi: 10.1016/j.rbmo.2023.04.009. Epub 2023 Apr 20.

Overview of Early ChatGPT's Presence in Medical Literature: Insights From a Hybrid Literature Review by ChatGPT and Human Experts.早期ChatGPT在医学文献中的呈现概述：来自ChatGPT与人类专家的混合文献综述的见解

Cureus. 2023 Apr 8;15(4):e37281. doi: 10.7759/cureus.37281. eCollection 2023 Apr.

ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns.ChatGPT在医学教育、研究与实践中的应用：对其前景与合理担忧的系统评价

Healthcare (Basel). 2023 Mar 19;11(6):887. doi: 10.3390/healthcare11060887.

Progression of Self-Directed Learning in Health Professions Education: Clarifying Terms and Processes.自主学习在健康职业教育中的进展：术语和过程的澄清。

Acad Med. 2024 Feb 1;99(2):236. doi: 10.1097/ACM.0000000000005191.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验