Schaye Verity, DiTullio David, Guzman Benedict Vincent, Vennemeyer Scott, Shih Hanniel, Reinstein Ilan, Weber Danielle E, Goodman Abbie, Wu Danny T Y, Sartori Daniel J, Santen Sally A, Gruppen Larry, Aphinyanaphongs Yindalon, Burk-Rafel Jesse
Department of Medicine, NYU Grossman School of Medicine, New York, NY, United States.
Institute for Innovations in Medical Education, NYU Grossman School of Medicine, New York, NY, United States.
J Med Internet Res. 2025 Mar 21;27:e67967. doi: 10.2196/67967.
Clinical reasoning (CR) is an essential skill; yet, physicians often receive limited feedback. Artificial intelligence holds promise to fill this gap.
We report the development of named entity recognition (NER), logic-based and large language model (LLM)-based assessments of CR documentation in the electronic health record across 2 institutions (New York University Grossman School of Medicine [NYU] and University of Cincinnati College of Medicine [UC]).
The note corpus consisted of internal medicine resident admission notes (retrospective set: July 2020-December 2021, n=700 NYU and 450 UC notes and prospective validation set: July 2023-December 2023, n=155 NYU and 92 UC notes). Clinicians rated CR documentation quality in each note using a previously validated tool (Revised-IDEA), on 3-point scales across 2 domains: differential diagnosis (D0, D1, and D2) and explanation of reasoning, (EA0, EA1, and EA2). At NYU, the retrospective set was annotated for NER for 5 entities (diagnosis, diagnostic category, prioritization of diagnosis language, data, and linkage terms). Models were developed using different artificial intelligence approaches, including NER, logic-based model: a large word vector model (scispaCy en_core_sci_lg) with model weights adjusted with backpropagation from annotations, developed at NYU with external validation at UC, NYUTron LLM: an NYU internal 110 million parameter LLM pretrained on 7.25 million clinical notes, only validated at NYU, and GatorTron LLM: an open source 345 million parameter LLM pretrained on 82 billion words of clinical text, fined tuned on NYU retrospective sets, then externally validated and further fine-tuned at UC. Model performance was assessed in the prospective sets with F-scores for the NER, logic-based model and area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) for the LLMs.
At NYU, the NYUTron LLM performed best: the D0 and D2 models had AUROC/AUPRC 0.87/0.79 and 0.89/0.86, respectively. The D1, EA0, and EA1 models had insufficient performance for implementation (AUROC range 0.57-0.80, AUPRC range 0.33-0.63). For the D1 classification, the approach pivoted to a stepwise approach taking advantage of the more performant D0 and D2 models. For the EA model, the approach pivoted to a binary EA2 model (ie, EA2 vs not EA2) with excellent performance, AUROC/AUPRC 0.85/ 0.80. At UC, the NER, D-logic-based model was the best performing D model (F-scores 0.80, 0.74, and 0.80 for D0, D1, D2, respectively. The GatorTron LLM performed best for EA2 scores AUROC/AUPRC 0.75/ 0.69.
This is the first multi-institutional study to apply LLMs for assessing CR documentation in the electronic health record. Such tools can enhance feedback on CR. Lessons learned by implementing these models at distinct institutions support the generalizability of this approach.
临床推理(CR)是一项基本技能;然而,医生通常得到的反馈有限。人工智能有望填补这一空白。
我们报告了在两个机构(纽约大学格罗斯曼医学院[NYU]和辛辛那提大学医学院[UC])中,基于命名实体识别(NER)、逻辑和大语言模型(LLM)对电子健康记录中的CR文档进行评估的开发情况。
笔记语料库由内科住院医师入院记录组成(回顾性数据集:2020年7月至2021年12月,NYU有700份记录,UC有450份记录;前瞻性验证数据集:2023年7月至2023年12月,NYU有155份记录,UC有92份记录)。临床医生使用先前验证过工具(修订后的IDEA)对每份记录中的CR文档质量进行评分,在两个领域采用3分制:鉴别诊断(D0、D1和D2)以及推理解释(EA0、EA1和EA2)。在NYU,回顾性数据集针对5个实体(诊断、诊断类别、诊断语言的优先级、数据和关联术语)进行了NER标注。使用不同的人工智能方法开发模型,包括NER、基于逻辑的模型:一个大词向量模型(scispaCy en_core_sci_lg),其模型权重通过注释的反向传播进行调整,在NYU开发并在UC进行外部验证;NYUTron LLM:一个NYU内部的1.1亿参数LLM,在725万份临床记录上进行预训练,仅在NYU进行验证;以及GatorTron LLM:一个开源的3.45亿参数LLM,在820亿字的临床文本上进行预训练,在NYU回顾性数据集上进行微调,然后在UC进行外部验证并进一步微调。在预期数据集中使用NER的F分数、基于逻辑的模型以及LLM的受试者操作特征曲线下面积(AUROC)和精确召回率曲线下面积(AUPRC)来评估模型性能。
在NYU,NYUTron LLM表现最佳:D0和D2模型的AUROC/AUPRC分别为0.87/0.79和0.89/0.86。D1、EA0和EA1模型的性能不足以用于实施(AUROC范围为0.57 - 0.80,AUPRC范围为0.33 - 0.63)。对于D1分类,方法转向利用性能更好的D0和D2模型的逐步方法。对于EA模型,方法转向具有出色性能(AUROC/AUPRC为0.85/0.80)的二元EA2模型(即EA2与非EA2)。在UC,基于NER、D逻辑的模型是表现最佳的D模型(D0、D1、D2的F分数分别为0.80、0.74和0.80)。GatorTron LLM在EA2分数方面表现最佳,AUROC/AUPRC为0.75/0.69。
这是第一项将LLM应用于评估电子健康记录中CR文档的多机构研究。此类工具可以增强对CR的反馈。在不同机构实施这些模型所吸取的经验教训支持这种方法的可推广性。