• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

自动患者记录评分:考察评分可靠性和可行性。

Automated Patient Note Grading: Examining Scoring Reliability and Feasibility.

机构信息

W.F. Bond is professor, Department of Emergency Medicine, University of Illinois College of Medicine, Peoria, Illinois, and is affiliated with Jump Simulation, an OSF HealthCare and University of Illinois College of Medicine at Peoria Collaboration; ORCID: http://orcid.org/0000-0001-6714-7152.

J. Zhou is a PhD student, Department of Computer Science, University of Illinois, Urbana-Champaign, Champaign, Illinois.

出版信息

Acad Med. 2023 Nov 1;98(11S):S90-S97. doi: 10.1097/ACM.0000000000005357. Epub 2023 Aug 1.

DOI:10.1097/ACM.0000000000005357
PMID:37983401
Abstract

PURPOSE

Scoring postencounter patient notes (PNs) yields significant insights into student performance, but the resource intensity of scoring limits its use. Recent advances in natural language processing (NLP) and machine learning allow application of automated short answer grading (ASAG) for this task. This retrospective study evaluated psychometric characteristics and reliability of an ASAG system for PNs and factors contributing to implementation, including feasibility and case-specific phrase annotation required to tune the system for a new case.

METHOD

PNs from standardized patient (SP) cases within a graduation competency exam were used to train the ASAG system, applying a feed-forward neural networks algorithm for scoring. Using faculty phrase-level annotation, 10 PNs per case were required to tune the ASAG system. After tuning, ASAG item-level ratings for 20 notes were compared across ASAG-faculty (4 cases, 80 pairings) and ASAG-nonfaculty (2 cases, 40 pairings). Psychometric characteristics were examined using item analysis and Cronbach's alpha. Inter-rater reliability (IRR) was examined using kappa.

RESULTS

ASAG scores demonstrated sufficient variability in differentiating learner PN performance and high IRR between machine and human ratings. Across all items the ASAG-faculty scoring mean kappa was .83 (SE ± .02). The ASAG-nonfaculty pairings kappa was .83 (SE ± .02). The ASAG scoring demonstrated high item discrimination. Internal consistency reliability values at the case level ranged from a Cronbach's alpha of .65 to .77. Faculty time cost to train and supervise nonfaculty raters for 4 cases was approximately $1,856. Faculty cost to tune the ASAG system was approximately $928.

CONCLUSIONS

NLP-based automated scoring of PNs demonstrated a high degree of reliability and psychometric confidence for use as learner feedback. The small number of phrase-level annotations required to tune the system to a new case enhances feasibility. ASAG-enabled PN scoring has broad implications for improving feedback in case-based learning contexts in medical education.

摘要

目的

对医患交流后记录(PN)进行评分可以深入了解学生的表现,但评分的资源密集度限制了其使用。自然语言处理(NLP)和机器学习的最新进展使得自动简答题评分(ASAG)可以应用于这项任务。本回顾性研究评估了用于 PN 的 ASAG 系统的心理测量学特征和可靠性,以及实施的相关因素,包括为新病例调整系统所需的可行性和特定于病例的短语注释。

方法

使用毕业能力考试中的标准化患者(SP)病例的 PN 来训练 ASAG 系统,采用前馈神经网络算法进行评分。使用教师短语级注释,每个病例需要 10 个 PN 来调整 ASAG 系统。调整后,对 20 个 PN 的 ASAG 项目级评分与 ASAG-教师(4 个病例,80 对)和 ASAG-非教师(2 个病例,40 对)进行比较。使用项目分析和克朗巴赫α检验心理测量学特征。使用kappa 检验组内一致性(IRR)。

结果

ASAG 评分在区分学习者 PN 表现方面具有足够的变异性,并且在机器和人工评分之间具有较高的 IRR。在所有项目中,ASAG-教师评分的平均 Kappa 值为.83(SE ±.02)。ASAG-非教师配对的 Kappa 值为.83(SE ±.02)。ASAG 评分表现出较高的项目区分度。案例水平的内部一致性信度值范围从 Cronbach 的α值为.65 到.77。培训和监督 4 个病例的非教师评分者的教师时间成本约为 1856 美元。调整 ASAG 系统的教师成本约为 928 美元。

结论

基于 NLP 的 PN 自动评分具有高度的可靠性和心理测量学可信度,可作为学习者的反馈。将系统调整到新病例所需的短语级注释数量很少,提高了可行性。ASAG 支持的 PN 评分对改善医学教育中基于病例的学习环境中的反馈具有广泛意义。

相似文献

1
Automated Patient Note Grading: Examining Scoring Reliability and Feasibility.自动患者记录评分:考察评分可靠性和可行性。
Acad Med. 2023 Nov 1;98(11S):S90-S97. doi: 10.1097/ACM.0000000000005357. Epub 2023 Aug 1.
2
Can Nonclinician Raters Be Trained to Assess Clinical Reasoning in Postencounter Patient Notes?非临床评分者能否经过培训来评估患者就诊后记录中的临床推理?
Acad Med. 2019 Nov;94(11S Association of American Medical Colleges Learn Serve Lead: Proceedings of the 58th Annual Research in Medical Education Sessions):S21-S27. doi: 10.1097/ACM.0000000000002904.
3
Inter-rater reliability and generalizability of patient note scores using a scoring rubric based on the USMLE Step-2 CS format.使用基于美国医师执照考试第二步临床技能考试(USMLE Step-2 CS)格式的评分标准时,评分者间信度及患者记录分数的可推广性。
Adv Health Sci Educ Theory Pract. 2016 Oct;21(4):761-73. doi: 10.1007/s10459-015-9664-3. Epub 2016 Jan 12.
4
Validity Evidence and Scoring Guidelines for Standardized Patient Encounters and Patient Notes From a Multisite Study of Clinical Performance Examinations in Seven Medical Schools.来自七所医学院校临床技能考试多中心研究的标准化患者问诊及患者记录的效度证据与评分指南
Acad Med. 2017 Nov;92(11S Association of American Medical Colleges Learn Serve Lead: Proceedings of the 56th Annual Research in Medical Education Sessions):S12-S20. doi: 10.1097/ACM.0000000000001918.
5
Optimizing Clinical Reasoning Assessments With Analytic and Holistic Ratings: Examining the Validity, Reliability, and Cost of a Simplified Patient Note Scoring Procedure.优化临床推理评估的分析和整体评分:考察简化患者记录评分程序的有效性、可靠性和成本。
Acad Med. 2022 Nov 1;97(11S):S15-S21. doi: 10.1097/ACM.0000000000004908. Epub 2022 Aug 9.
6
A Multi-institutional Study of the Feasibility and Reliability of the Implementation of Constructed Response Exam Questions.多机构研究构建反应考试问题实施的可行性和可靠性。
Teach Learn Med. 2023 Oct-Dec;35(5):609-622. doi: 10.1080/10401334.2022.2111571. Epub 2022 Aug 20.
7
High-fidelity patient simulation: validation of performance checklists.高保真患者模拟:性能检查清单的验证
Br J Anaesth. 2004 Mar;92(3):388-92. doi: 10.1093/bja/aeh081. Epub 2004 Jan 22.
8
Comparing Scoring Consistency of Large Language Models with Faculty for Formative Assessments in Medical Education.比较大语言模型与医学教育形成性评估教师评分的一致性
J Gen Intern Med. 2025 Jan;40(1):127-134. doi: 10.1007/s11606-024-09050-9. Epub 2024 Oct 14.
9
Validity evidence for a patient note scoring rubric based on the new patient note format of the United States Medical Licensing Examination.基于美国医师执照考试新的病历书写格式的病历评分细则的有效性证据。
Acad Med. 2013 Oct;88(10):1552-7. doi: 10.1097/ACM.0b013e3182a34b1e.
10
Differential Weighting for Subcomponent Measures of Integrated Clinical Encounter Scores Based on the USMLE Step 2 CS Examination: Effects on Composite Score Reliability and Pass-Fail Decisions.基于美国医师执照考试第二步临床技能考试的综合临床问诊分数子成分测量的差异加权:对综合分数可靠性及通过-未通过决策的影响
Acad Med. 2016 Nov;91(11 Association of American Medical Colleges Learn Serve Lead: Proceedings of the 55th Annual Research in Medical Education Sessions):S24-S30. doi: 10.1097/ACM.0000000000001359.

引用本文的文献

1
Virtual Patients Using Large Language Models: Scalable, Contextualized Simulation of Clinician-Patient Dialogue With Feedback.使用大语言模型的虚拟患者:具有反馈功能的临床医生-患者对话的可扩展、情境化模拟
J Med Internet Res. 2025 Apr 4;27:e68486. doi: 10.2196/68486.
2
Large language models for generating medical examinations: systematic review.生成医学检查的大型语言模型:系统评价。
BMC Med Educ. 2024 Mar 29;24(1):354. doi: 10.1186/s12909-024-05239-y.