使用适配的开源和闭源大语言模型评估影像检查申请单所附临床病史的完整性

Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models.

作者信息

Larson David B, Koirala Arogya, Cheuy Lina Y, Paschali Magdalini, Van Veen Dave, Na Hye Sun, Petterson Matthew B, Fang Zhongnan, Chaudhari Akshay S

机构信息

Department of Radiology, Stanford University School of Medicine, 453 Quarry Rd, MC 5659, Stanford, CA 94304.

AI Development and Evaluation Laboratory, Stanford University School of Medicine, Palo Alto, Calif.

出版信息

Radiology. 2025 Feb;314(2):e241051. doi: 10.1148/radiol.241051.

DOI:10.1148/radiol.241051

PMID:39998369

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11868845/

Abstract

Background Incomplete clinical histories are a well-known problem in radiology. Previous dedicated quality improvement efforts focusing on reproducible assessments of the completeness of free-text clinical histories have relied on tedious manual analysis. Purpose To adapt and evaluate open-source and closed-source large language models (LLMs) for their ability to automatically extract clinical history elements within imaging orders and to use the best-performing adapted open-source model to assess the completeness of a large sample of clinical histories as a benchmark for clinical practice. Materials and Methods This retrospective single-site study used previously extracted information accompanying CT, MRI, US, and radiography orders from August 2020 to May 2022 at an adult and pediatric emergency department of a 613-bed tertiary academic medical center. Two open-source (Llama 2-7B [Meta], Mistral-7B [Mistral AI]) and one closed-source (GPT-4 Turbo [OpenAI]) LLMs were adapted using prompt engineering, in-context learning, and fine-tuning (open-source only) to extract the elements "past medical history," "what," "when," "where," and "clinical concern" from clinical histories. Model performance, interreader agreement using Cohen κ (none to slight, 0.01-0.20; fair, 0.21-0.40; moderate, 0.41-0.60; substantial, 0.61-0.80; almost perfect, 0.81-1.00), and semantic similarity between the models and the adjudicated manual annotations of two board-certified radiologists with 16 and 3 years of postfellowship experience, respectively, were assessed using accuracy, Cohen κ, and BERTScore, an LLM metric that quantifies how well two pieces of text convey the same meaning; 95% CIs were also calculated. The best-performing open-source model was then used to assess completeness on a large dataset of unannotated clinical histories. Results A total of 50 186 clinical histories were included (794 training, 150 validation, 300 initial testing, 48 942 real-world application). Of the two open-source models, Mistral-7B outperformed Llama 2-7B in assessing completeness and was further fine-tuned. Both Mistral-7B and GPT-4 Turbo showed substantial overall agreement with radiologists (mean κ, 0.73 [95% CI: 0.67, 0.78] to 0.77 [95% CI: 0.71, 0.82]) and adjudicated annotations (mean BERTScore, 0.96 [95% CI: 0.96, 0.97] for both models; = .38). Mistral-7B also rivaled GPT-4 Turbo in performance (weighted overall mean accuracy, 91% [95% CI: 89, 93] vs 92% [95% CI: 90, 94]; = .31) despite being a smaller model. Using Mistral-7B, 26.2% (12 803 of 48 942) of unannotated clinical histories were found to contain all five elements. Conclusion An easily deployable fine-tuned open-source LLM (Mistral-7B), rivaling GPT-4 Turbo in performance, could effectively extract clinical history elements with substantial agreement with radiologists and produce a benchmark for completeness of a large sample of clinical histories. The model and code will be fully open-sourced. © RSNA, 2025

摘要

背景临床病史不完整是放射学中一个众所周知的问题。以往专注于对自由文本临床病史完整性进行可重复评估的质量改进工作依赖于繁琐的人工分析。目的对开源和闭源大语言模型（LLM）进行调整和评估，以检验其自动提取影像检查申请单中临床病史要素的能力，并使用性能最佳的经过调整的开源模型评估大量临床病史样本的完整性，以此作为临床实践的基准。材料与方法这项回顾性单中心研究使用了2020年8月至2022年5月期间在一家拥有613张床位的三级学术医疗中心的成人及儿科急诊科的CT、MRI、超声和X线检查申请单中先前提取的信息。使用提示工程、上下文学习和微调（仅适用于开源模型）对两个开源模型（Llama 2 - 7B [Meta]、Mistral - 7B [Mistral AI]）和一个闭源模型（GPT - 4 Turbo [OpenAI]）进行调整，以从临床病史中提取“既往病史”“何事”“何时”“何处”和“临床关注点”等要素。使用准确率、Cohen κ系数以及BERTScore（一种量化两段文本传达相同含义程度的LLM指标）评估模型性能、使用Cohen κ系数评估阅片者间一致性（无至轻度，0.01 - 0.20；一般，0.21 - 0.40；中等，0.41 - 0.60；高度，0.61 - 0.80；几乎完美，0.81 - 1.00），以及模型与两名分别具有16年和3年 fellowship后经验的经过委员会认证的放射科医生的裁定人工注释之间的语义相似度；还计算了95%置信区间。然后使用性能最佳的开源模型评估一大组未注释临床病史的完整性。结果共纳入50186份临床病史（794份用于训练，150份用于验证，300份用于初始测试，48942份用于实际应用）。在两个开源模型中，Mistral - 7B在评估完整性方面优于Llama 2 - 7B，并进一步进行了微调。Mistral - 7B和GPT - 4 Turbo在与放射科医生的总体一致性方面均表现出高度一致性（平均κ系数，0.73 [95%置信区间：0.67, 0.78]至0.77 [95%置信区间：0.71, 0.82]）以及与裁定注释的一致性（两个模型的平均BERTScore均为0.96 [95%置信区间：0.96, 0.97]；P = 0.