Garcia-Carmona Angel Manuel, Prieto Maria-Lorena, Puertas Enrique, Beunza Juan-Jose
Research and Doctorate School, Universidad Europea de Madrid, Madrid, Spain.
Department of Computing and Technology, Universidad Europea de Madrid, Madrid, Spain.
JMIR AI. 2025 Jul 3;4:e68776. doi: 10.2196/68776.
The digital transformation of health care has introduced both opportunities and challenges, particularly in managing and analyzing the vast amounts of unstructured medical data generated daily. There is a need to explore the feasibility of generative solutions in extracting data from medical reports, categorized by specific criteria.
This study aimed to investigate the application of large language models (LLMs) for the automated extraction of structured information from unstructured medical reports, using the LangChain framework in Python.
Through a systematic evaluation of leading LLMs-GPT-4o, Llama 3, Llama 3.1, Gemma 2, Qwen 2, and Qwen 2.5-using zero-shot prompting techniques and embedding results into a vector database, this study assessed the performance of LLMs in extracting patient demographics, diagnostic details, and pharmacological data.
Evaluation metrics, including accuracy, precision, recall, and F-score, revealed high efficacy across most categories, with GPT-4o achieving the highest overall performance (91.4% accuracy).
The findings highlight notable differences in precision and recall between models, particularly in extracting names and age-related information. There were challenges in processing unstructured medical text, including variability in model performance across data types. Our findings demonstrate the feasibility of integrating LLMs into health care workflows; LLMs offer substantial improvements in data accessibility and support clinical decision-making processes. In addition, the paper describes the role of retrieval-augmented generation techniques in enhancing information retrieval accuracy, addressing issues such as hallucinations and outdated data in LLM outputs. Future work should explore the need for optimization through larger and more diverse training datasets, advanced prompting strategies, and the integration of domain-specific knowledge to improve model generalizability and precision.
医疗保健的数字化转型带来了机遇和挑战,尤其是在管理和分析每天产生的大量非结构化医疗数据方面。有必要探索生成式解决方案从医疗报告中按特定标准提取数据的可行性。
本研究旨在使用Python中的LangChain框架,研究大语言模型(LLMs)在从非结构化医疗报告中自动提取结构化信息方面的应用。
通过对领先的大语言模型——GPT-4o、Llama 3、Llama 3.1、Gemma 2、Qwen 2和Qwen 2.5进行系统评估,使用零样本提示技术并将结果嵌入向量数据库,本研究评估了大语言模型在提取患者人口统计学信息、诊断细节和药理学数据方面的性能。
包括准确率、精确率、召回率和F值在内的评估指标显示,大多数类别都具有较高的效率,GPT-4o的整体性能最高(准确率为91.4%)。
研究结果突出了模型之间在精确率和召回率方面的显著差异,特别是在提取姓名和年龄相关信息方面。处理非结构化医疗文本存在挑战,包括不同数据类型的模型性能存在差异。我们的研究结果证明了将大语言模型集成到医疗保健工作流程中的可行性;大语言模型在数据可访问性方面有显著改进,并支持临床决策过程。此外,本文描述了检索增强生成技术在提高信息检索准确性方面的作用,解决了大语言模型输出中的幻觉和过时数据等问题。未来的工作应通过更大、更多样化的训练数据集、先进的提示策略以及整合特定领域知识来探索优化的必要性,以提高模型的通用性和精确率。