Sreedhar Radhika, Chang Linda, Gangopadhyaya Ananya, Shiels Peggy Woziwodzki, Loza Julie, Chi Euna, Gabel Elizabeth, Park Yoon Soo
University of Illinois College of Medicine, Chicago, IL, USA.
J Gen Intern Med. 2025 Jan;40(1):127-134. doi: 10.1007/s11606-024-09050-9. Epub 2024 Oct 14.
The Liaison Committee on Medical Education requires that medical students receive individualized feedback on their self-directed learning skills. Pre-clinical students are asked to complete multiple spaced critical appraisal assignments. However, the individual feedback requires significant faculty time. As large language models (LLMs) can score and generate feedback, we explored their use in grading formative assessments through validity and feasibility lenses.
To explore the consistency and feasibility of using an LLM to assess and provide feedback for formative assessments in undergraduate medical education.
This was a cross-sectional study of pre-clinical students' critical appraisal assignments at University of Illinois College of Medicine (UICOM) during the 2022-2023 academic year.
An initial sample of ten assignments was used to develop a prompt. For each student entry, the de-identified assignment and prompt were provided to ChatGPT 3.5, and its scoring was compared to the existing faculty grade.
Differences in scoring of individual items between ChatGPT and faculty were assessed. Scoring consistency using inter-rater reliability (IRR) was calculated as percent exact agreement. Chi-squared test was used to determine if there were significant differences in scores. Psychometric characteristics including internal-consistency reliability, area under precision-recall curve (AUCPR), and cost were studied.
In this cross-sectional study, 111 pre-clinical students' faculty graded assignments were compared with that of ChatGPT and the scoring of individual items was comparable. The overall agreement between ChatGPT and faculty was 67% (OR = 2.53, P < 0.001); mean AUCPR was 0.69 (range 0.61-0.76). Internal-consistency reliability of ChatGPT was 0.64 and its use resulted in a fivefold reduction in faculty time, and potential savings of 150 faculty hours.
This study of psychometric characteristics of ChatGPT demonstrates the potential role for LLMs to assist faculty in assessing and providing feedback for formative assignments.
医学教育联络委员会要求医学生获得关于其自主学习技能的个性化反馈。临床前学生需要完成多项间隔的批判性评估作业。然而,个性化反馈需要教师花费大量时间。由于大语言模型(LLMs)可以进行评分并生成反馈,我们从有效性和可行性的角度探讨了它们在形成性评估评分中的应用。
探讨使用大语言模型评估本科医学教育形成性评估并提供反馈的一致性和可行性。
这是一项对伊利诺伊大学医学院(UICOM)2022 - 2023学年临床前学生批判性评估作业的横断面研究。
最初选取十份作业样本用于制定提示。对于每个学生的作业,将去识别化的作业和提示提供给ChatGPT 3.5,并将其评分与现有的教师评分进行比较。
评估ChatGPT和教师在单个项目评分上的差异。使用评分者间信度(IRR)计算评分一致性,以精确一致百分比表示。使用卡方检验确定分数是否存在显著差异。研究了包括内部一致性信度、精确召回曲线下面积(AUCPR)和成本在内的心理测量特征。
在这项横断面研究中,将111名临床前学生的教师评分作业与ChatGPT的评分进行了比较,单个项目的评分具有可比性。ChatGPT与教师之间的总体一致性为67%(OR = 2.53,P < 0.001);平均AUCPR为0.69(范围0.61 - 0.76)。ChatGPT的内部一致性信度为0.64,其使用使教师时间减少了五倍,并可能节省150个教师工时。
这项对ChatGPT心理测量特征的研究表明,大语言模型在协助教师评估形成性作业并提供反馈方面具有潜在作用。