A.T. Cianciolo is associate professor of medical education, Southern Illinois University School of Medicine, Springfield, Illinois; ORCID: https://orcid.org/0000-0001-5948-9304 .
N. LaVoie is president, Parallel Consulting, Petaluma, California; ORCID: https://orcid.org/0000-0002-7013-3568 .
Acad Med. 2021 Jul 1;96(7):1026-1035. doi: 10.1097/ACM.0000000000004010.
Developing medical students' clinical reasoning requires a structured longitudinal curriculum with frequent targeted assessment and feedback. Performance-based assessments, which have the strongest validity evidence, are currently not feasible for this purpose because they are time-intensive to score. This study explored the potential of using machine learning technologies to score one such assessment-the diagnostic justification essay.
From May to September 2018, machine scoring algorithms were trained to score a sample of 700 diagnostic justification essays written by 414 third-year medical students from the Southern Illinois University School of Medicine classes of 2012-2017. The algorithms applied semantically based natural language processing metrics (e.g., coherence, readability) to assess essay quality on 4 criteria (differential diagnosis, recognition and use of findings, workup, and thought process); the scores for these criteria were summed to create overall scores. Three sources of validity evidence (response process, internal structure, and association with other variables) were examined.
Machine scores correlated more strongly with faculty ratings than faculty ratings did with each other (machine: .28-.53, faculty: .13-.33) and were less case-specific. Machine scores and faculty ratings were similarly correlated with medical knowledge, clinical cognition, and prior diagnostic justification. Machine scores were more strongly associated with clinical communication than were faculty ratings (.43 vs .31).
Machine learning technologies may be useful for assessing medical students' long-form written clinical reasoning. Semantically based machine scoring may capture the communicative aspects of clinical reasoning better than faculty ratings, offering the potential for automated assessment that generalizes to the workplace. These results underscore the potential of machine scoring to capture an aspect of clinical reasoning performance that is difficult to assess with traditional analytic scoring methods. Additional research should investigate machine scoring generalizability and examine its acceptability to trainees and educators.
培养医学生的临床推理能力需要一个结构合理的纵向课程,课程应频繁地进行有针对性的评估和反馈。基于表现的评估具有最强的有效性证据,但目前由于评分耗时,不适合用于此目的。本研究探讨了使用机器学习技术对其中一种评估方法——诊断理由论文进行评分的可能性。
2018 年 5 月至 9 月,我们训练了机器评分算法,以对来自南伊利诺伊大学医学院 2012-2017 届的 414 名三年级医学生所写的 700 篇诊断理由论文进行评分。该算法应用基于语义的自然语言处理指标(如连贯性、可读性),根据 4 个标准(鉴别诊断、发现的识别和使用、检查和思维过程)评估论文质量;这些标准的分数相加得出总分。考察了 3 种有效性证据来源(反应过程、内部结构和与其他变量的关联)。
机器评分与教师评分的相关性强于教师评分之间的相关性(机器:.28-.53,教师:.13-.33),并且不那么具体病例。机器评分和教师评分与医学知识、临床认知和先前的诊断理由具有相似的相关性。与教师评分相比,机器评分与临床沟通的相关性更强(.43 与.31)。
机器学习技术可能有助于评估医学生的长篇书面临床推理能力。基于语义的机器评分可能比教师评分更好地捕捉临床推理的沟通方面,从而提供具有普遍适用性的自动化评估的潜力。这些结果强调了机器评分捕捉传统分析评分方法难以评估的临床推理表现方面的潜力。进一步的研究应调查机器评分的通用性,并研究其对学员和教育者的可接受性。