W.F. Bond is professor, Department of Emergency Medicine, University of Illinois College of Medicine, Peoria, Illinois, and is affiliated with Jump Simulation, an OSF HealthCare and University of Illinois College of Medicine at Peoria Collaboration; ORCID: http://orcid.org/0000-0001-6714-7152.
J. Zhou is a PhD student, Department of Computer Science, University of Illinois, Urbana-Champaign, Champaign, Illinois.
Acad Med. 2023 Nov 1;98(11S):S90-S97. doi: 10.1097/ACM.0000000000005357. Epub 2023 Aug 1.
Scoring postencounter patient notes (PNs) yields significant insights into student performance, but the resource intensity of scoring limits its use. Recent advances in natural language processing (NLP) and machine learning allow application of automated short answer grading (ASAG) for this task. This retrospective study evaluated psychometric characteristics and reliability of an ASAG system for PNs and factors contributing to implementation, including feasibility and case-specific phrase annotation required to tune the system for a new case.
PNs from standardized patient (SP) cases within a graduation competency exam were used to train the ASAG system, applying a feed-forward neural networks algorithm for scoring. Using faculty phrase-level annotation, 10 PNs per case were required to tune the ASAG system. After tuning, ASAG item-level ratings for 20 notes were compared across ASAG-faculty (4 cases, 80 pairings) and ASAG-nonfaculty (2 cases, 40 pairings). Psychometric characteristics were examined using item analysis and Cronbach's alpha. Inter-rater reliability (IRR) was examined using kappa.
ASAG scores demonstrated sufficient variability in differentiating learner PN performance and high IRR between machine and human ratings. Across all items the ASAG-faculty scoring mean kappa was .83 (SE ± .02). The ASAG-nonfaculty pairings kappa was .83 (SE ± .02). The ASAG scoring demonstrated high item discrimination. Internal consistency reliability values at the case level ranged from a Cronbach's alpha of .65 to .77. Faculty time cost to train and supervise nonfaculty raters for 4 cases was approximately $1,856. Faculty cost to tune the ASAG system was approximately $928.
NLP-based automated scoring of PNs demonstrated a high degree of reliability and psychometric confidence for use as learner feedback. The small number of phrase-level annotations required to tune the system to a new case enhances feasibility. ASAG-enabled PN scoring has broad implications for improving feedback in case-based learning contexts in medical education.
对医患交流后记录(PN)进行评分可以深入了解学生的表现,但评分的资源密集度限制了其使用。自然语言处理(NLP)和机器学习的最新进展使得自动简答题评分(ASAG)可以应用于这项任务。本回顾性研究评估了用于 PN 的 ASAG 系统的心理测量学特征和可靠性,以及实施的相关因素,包括为新病例调整系统所需的可行性和特定于病例的短语注释。
使用毕业能力考试中的标准化患者(SP)病例的 PN 来训练 ASAG 系统,采用前馈神经网络算法进行评分。使用教师短语级注释,每个病例需要 10 个 PN 来调整 ASAG 系统。调整后,对 20 个 PN 的 ASAG 项目级评分与 ASAG-教师(4 个病例,80 对)和 ASAG-非教师(2 个病例,40 对)进行比较。使用项目分析和克朗巴赫α检验心理测量学特征。使用kappa 检验组内一致性(IRR)。
ASAG 评分在区分学习者 PN 表现方面具有足够的变异性,并且在机器和人工评分之间具有较高的 IRR。在所有项目中,ASAG-教师评分的平均 Kappa 值为.83(SE ±.02)。ASAG-非教师配对的 Kappa 值为.83(SE ±.02)。ASAG 评分表现出较高的项目区分度。案例水平的内部一致性信度值范围从 Cronbach 的α值为.65 到.77。培训和监督 4 个病例的非教师评分者的教师时间成本约为 1856 美元。调整 ASAG 系统的教师成本约为 928 美元。
基于 NLP 的 PN 自动评分具有高度的可靠性和心理测量学可信度,可作为学习者的反馈。将系统调整到新病例所需的短语级注释数量很少,提高了可行性。ASAG 支持的 PN 评分对改善医学教育中基于病例的学习环境中的反馈具有广泛意义。