Chi Jonathan, Rouphail Yazan, Hillis Ethan, Ma Ningning, Nguyen An, Wang Jane, Hofford Mackenzie, Gupta Aditi, Lyons Patrick G, Wilcox Adam, Lai Albert M, Payne Philip R O, Kollef Marin H, Dreisbach Caitlin, Michelson Andrew P
Goergen Institute for Data Science and Artificial Intelligence, University of Rochester, Rochester, NY 14627, United States.
Department of Medicine, Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis, St. Louis, MO 63110, United States.
JAMIA Open. 2025 Aug 13;8(4):ooaf092. doi: 10.1093/jamiaopen/ooaf092. eCollection 2025 Aug.
Large language models (LLMs) have demonstrated high levels of performance in clinical information extraction compared to rule-based systems and traditional machine-learning approaches, offering scalability, contextualization, and easier deployment. However, most studies rely on proprietary models with privacy concerns and high costs, limiting accessibility. We aim to evaluate 14 publicly available open-source LLMs for extracting clinically relevant findings from free-text echocardiogram reports and examine the feasibility of their implementation in information extraction workflows.
We used 14 open-source LLM models to extract clinically relevant entities from echocardiogram reports ( = 507). Each report was manually annotated by 2 independent health-care professionals and adjudicated by a third. Lexical variance and length of each echocardiogram report were collected. Precision, recall, and F1 scores were calculated for the 9 extracted entities via multiclass classification.
In aggregate, Gemma2:9b-instruct had the highest precision, recall, and F1 scores at 0.973 (0.962-0.983), 0.959 (0.947-0.973), and 0.965 (0.951-0.975), respectively. In comparison, Phi3:3.8b-mini-instruct had the lowest precision score at 0.831 (0.804-0.856), while Gemma:7b-instruct had the lowest recall and F1 scores at 0.382 (0.356-0.408) and 0.392 (0.356-0.428), respectively.
Using LLMs for entity extraction for echocardiogram reports has the potential to support both clinical research and health-care delivery. Our work demonstrates the feasibility of using open-source models for more efficient computation and extraction.
与基于规则的系统和传统机器学习方法相比,大语言模型(LLMs)在临床信息提取方面表现出了很高的性能,具有可扩展性、上下文感知能力且易于部署。然而,大多数研究依赖于存在隐私问题和高成本的专有模型,限制了其可及性。我们旨在评估14个公开可用的开源大语言模型,用于从自由文本超声心动图报告中提取临床相关发现,并检验其在信息提取工作流程中实施的可行性。
我们使用14个开源大语言模型从超声心动图报告(n = 507)中提取临床相关实体。每份报告由2名独立的医疗保健专业人员进行人工标注,并由第三名人员进行裁决。收集了每份超声心动图报告的词汇差异和长度。通过多类分类计算9个提取实体的精确率、召回率和F1分数。
总体而言,Gemma2:9b-instruct的精确率、召回率和F1分数最高,分别为0.973(0.962 - 0.983)、0.959(0.947 - 0.973)和0.965(0.951 - 0.975)。相比之下,Phi3:3.8b-mini-instruct的精确率得分最低,为0.831(0.804 - 0.856),而Gemma:7b-instruct的召回率和F1分数最低,分别为0.382(0.356 - 0.408)和0.392(0.356 - 0.428)。
使用大语言模型进行超声心动图报告的实体提取有潜力支持临床研究和医疗服务。我们的工作证明了使用开源模型进行更高效计算和提取的可行性。