Suppr超能文献

比较大语言模型与医学教育形成性评估教师评分的一致性

Comparing Scoring Consistency of Large Language Models with Faculty for Formative Assessments in Medical Education.

作者信息

Sreedhar Radhika, Chang Linda, Gangopadhyaya Ananya, Shiels Peggy Woziwodzki, Loza Julie, Chi Euna, Gabel Elizabeth, Park Yoon Soo

机构信息

University of Illinois College of Medicine, Chicago, IL, USA.

出版信息

J Gen Intern Med. 2025 Jan;40(1):127-134. doi: 10.1007/s11606-024-09050-9. Epub 2024 Oct 14.

Abstract

BACKGROUND

The Liaison Committee on Medical Education requires that medical students receive individualized feedback on their self-directed learning skills. Pre-clinical students are asked to complete multiple spaced critical appraisal assignments. However, the individual feedback requires significant faculty time. As large language models (LLMs) can score and generate feedback, we explored their use in grading formative assessments through validity and feasibility lenses.

OBJECTIVE

To explore the consistency and feasibility of using an LLM to assess and provide feedback for formative assessments in undergraduate medical education.

DESIGN AND PARTICIPANTS

This was a cross-sectional study of pre-clinical students' critical appraisal assignments at University of Illinois College of Medicine (UICOM) during the 2022-2023 academic year.

INTERVENTION

An initial sample of ten assignments was used to develop a prompt. For each student entry, the de-identified assignment and prompt were provided to ChatGPT 3.5, and its scoring was compared to the existing faculty grade.

MAIN MEASURES

Differences in scoring of individual items between ChatGPT and faculty were assessed. Scoring consistency using inter-rater reliability (IRR) was calculated as percent exact agreement. Chi-squared test was used to determine if there were significant differences in scores. Psychometric characteristics including internal-consistency reliability, area under precision-recall curve (AUCPR), and cost were studied.

KEY RESULTS

In this cross-sectional study, 111 pre-clinical students' faculty graded assignments were compared with that of ChatGPT and the scoring of individual items was comparable. The overall agreement between ChatGPT and faculty was 67% (OR = 2.53, P < 0.001); mean AUCPR was 0.69 (range 0.61-0.76). Internal-consistency reliability of ChatGPT was 0.64 and its use resulted in a fivefold reduction in faculty time, and potential savings of 150 faculty hours.

CONCLUSIONS

This study of psychometric characteristics of ChatGPT demonstrates the potential role for LLMs to assist faculty in assessing and providing feedback for formative assignments.

摘要

背景

医学教育联络委员会要求医学生获得关于其自主学习技能的个性化反馈。临床前学生需要完成多项间隔的批判性评估作业。然而,个性化反馈需要教师花费大量时间。由于大语言模型(LLMs)可以进行评分并生成反馈,我们从有效性和可行性的角度探讨了它们在形成性评估评分中的应用。

目的

探讨使用大语言模型评估本科医学教育形成性评估并提供反馈的一致性和可行性。

设计与参与者

这是一项对伊利诺伊大学医学院(UICOM)2022 - 2023学年临床前学生批判性评估作业的横断面研究。

干预

最初选取十份作业样本用于制定提示。对于每个学生的作业,将去识别化的作业和提示提供给ChatGPT 3.5,并将其评分与现有的教师评分进行比较。

主要测量指标

评估ChatGPT和教师在单个项目评分上的差异。使用评分者间信度(IRR)计算评分一致性,以精确一致百分比表示。使用卡方检验确定分数是否存在显著差异。研究了包括内部一致性信度、精确召回曲线下面积(AUCPR)和成本在内的心理测量特征。

关键结果

在这项横断面研究中,将111名临床前学生的教师评分作业与ChatGPT的评分进行了比较,单个项目的评分具有可比性。ChatGPT与教师之间的总体一致性为67%(OR = 2.53,P < 0.001);平均AUCPR为0.69(范围0.61 - 0.76)。ChatGPT的内部一致性信度为0.64,其使用使教师时间减少了五倍,并可能节省150个教师工时。

结论

这项对ChatGPT心理测量特征的研究表明,大语言模型在协助教师评估形成性作业并提供反馈方面具有潜在作用。

相似文献

10

本文引用的文献

7
Artificial intelligence in scientific writing: a friend or a foe?人工智能在科学写作中的应用:是敌是友?
Reprod Biomed Online. 2023 Jul;47(1):3-9. doi: 10.1016/j.rbmo.2023.04.009. Epub 2023 Apr 20.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验