• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于从非结构化和半结构化电子健康记录中提取数据的大语言模型:多模型性能评估

Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation.

作者信息

Ntinopoulos Vasileios, Rodriguez Cetina Biefer Hector, Tudorache Igor, Papadopoulos Nestoras, Odavic Dragan, Risteski Petar, Haeussler Achim, Dzemali Omer

机构信息

Department of Cardiac Surgery, University Hospital of Zurich, Zurich, Switzerland.

Department of Cardiac Surgery, Municipal Hospital of Zurich - Triemli, Zurich, Switzerland.

出版信息

BMJ Health Care Inform. 2025 Jan 19;32(1):e101139. doi: 10.1136/bmjhci-2024-101139.

DOI:10.1136/bmjhci-2024-101139
PMID:39832824
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11751965/
Abstract

OBJECTIVES

We aimed to evaluate the performance of multiple large language models (LLMs) in data extraction from unstructured and semi-structured electronic health records.

METHODS

50 synthetic medical notes in English, containing a structured and an unstructured part, were drafted and evaluated by domain experts, and subsequently used for LLM-prompting. 18 LLMs were evaluated against a baseline transformer-based model. Performance assessment comprised four entity extraction and five binary classification tasks with a total of 450 predictions for each LLM. LLM-response consistency assessment was performed over three same-prompt iterations.

RESULTS

Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b exhibited an excellent overall accuracy >0.98 (0.995, 0.988, 0.988, 0.988, 0.986, 0.982, 0.982, and 0.982, respectively), significantly higher than the baseline RoBERTa model (0.742). Claude 2.0, Claude 2.1, Claude 3.0 Opus, PaLM 2 chat-bison, GPT 4, Claude 3.0 Sonnet and Llama 3-70b showed a marginally higher and Gemini Advanced a marginally lower multiple-run consistency than the baseline model RoBERTa (Krippendorff's alpha value 1, 0.998, 0.996, 0.996, 0.992, 0.991, 0.989, 0.988, and 0.985, respectively).

DISCUSSION

Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat bison and Llama 3-70b performed the best, exhibiting outstanding performance in both entity extraction and binary classification, with highly consistent responses over multiple same-prompt iterations. Their use could leverage data for research and unburden healthcare professionals. Real-data analyses are warranted to confirm their performance in a real-world setting.

CONCLUSION

Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b seem to be able to reliably extract data from unstructured and semi-structured electronic health records. Further analyses using real data are warranted to confirm their performance in a real-world setting.

摘要

目标

我们旨在评估多个大语言模型(LLMs)从非结构化和半结构化电子健康记录中提取数据的性能。

方法

起草了50份英文合成医疗记录,包含结构化和非结构化部分,由领域专家进行评估,随后用于大语言模型提示。针对基于基线变压器的模型评估了18个大语言模型。性能评估包括四项实体提取和五项二元分类任务,每个大语言模型共有450个预测。在三次相同提示迭代中进行大语言模型响应一致性评估。

结果

Claude 3.0 Opus、Claude 3.0 Sonnet、Claude 2.0、GPT 4、Claude 2.1、Gemini Advanced、PaLM 2 chat-bison和Llama 3-70b表现出出色的总体准确率>0.98(分别为0.995、0.988、0.988、0.988、0.986、0.982、0.982和0.982),显著高于基线RoBERTa模型(0.742)。Claude 2.0、Claude 2.1、Claude 3.0 Opus、PaLM 2 chat-bison、GPT 4、Claude 3.0 Sonnet和Llama 3-70b的多次运行一致性略高于基线模型RoBERTa,而Gemini Advanced略低于基线模型RoBERTa(Krippendorff's alpha值分别为1、0.998、0.996、0.996、0.992、0.991、0.989、0.988和0.985)。

讨论

Claude 3.0 Opus、Claude 3.0 Sonnet、Claude 2.0、GPT 4、Claude 2.1、Gemini Advanced、PaLM 2 chat bison和Llama 3-70b表现最佳,在实体提取和二元分类方面均表现出色,在多次相同提示迭代中响应高度一致。它们的使用可以利用数据进行研究并减轻医疗保健专业人员的负担。有必要进行实际数据分析以确认它们在现实环境中的性能。

结论

Claude 3.0 Opus、Claude 3.0 Sonnet、Claude 2.0、GPT 4、Claude 2.1、Gemini Advanced、PaLM 2 chat-bison和Llama 3-70b似乎能够可靠地从非结构化和半结构化电子健康记录中提取数据。有必要使用实际数据进行进一步分析以确认它们在现实环境中的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/541e/11751965/6e6b2a5fdfdc/bmjhci-32-1-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/541e/11751965/1f5a809b754e/bmjhci-32-1-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/541e/11751965/85509ac31cc1/bmjhci-32-1-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/541e/11751965/6e6b2a5fdfdc/bmjhci-32-1-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/541e/11751965/1f5a809b754e/bmjhci-32-1-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/541e/11751965/85509ac31cc1/bmjhci-32-1-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/541e/11751965/6e6b2a5fdfdc/bmjhci-32-1-g003.jpg

相似文献

1
Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation.用于从非结构化和半结构化电子健康记录中提取数据的大语言模型:多模型性能评估
BMJ Health Care Inform. 2025 Jan 19;32(1):e101139. doi: 10.1136/bmjhci-2024-101139.
2
Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.使用GPT-4o和Llama-3.3-70B从自由文本中风CT报告中提取数据:注释指南的影响
Eur Radiol Exp. 2025 Jun 19;9(1):61. doi: 10.1186/s41747-025-00600-2.
3
Evaluating Large Language Models for Preoperative Patient Education in Superior Capsular Reconstruction: Comparative Study of Claude, GPT, and Gemini.评估大语言模型在肩胛下肌上囊重建术前患者教育中的应用:Claude、GPT和Gemini的比较研究
JMIR Perioper Med. 2025 Jun 12;8:e70047. doi: 10.2196/70047.
4
The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.生成式预训练变换器4(GPT-4)分析三种不同语言医学笔记的潜力:一项回顾性模型评估研究。
Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.
5
Dynamic few-shot prompting for clinical note section classification using lightweight, open-source large language models.使用轻量级开源大语言模型进行临床笔记章节分类的动态少样本提示
J Am Med Inform Assoc. 2025 Jul 1;32(7):1164-1173. doi: 10.1093/jamia/ocaf084.
6
Assessing large language models for acute heart failure classification and information extraction from French clinical notes.评估大型语言模型用于急性心力衰竭分类及从法国临床记录中提取信息。
Comput Biol Med. 2025 Sep;195:110609. doi: 10.1016/j.compbiomed.2025.110609. Epub 2025 Jun 19.
7
Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.利用大语言模型对合成及真实世界社交媒体上有关结膜炎爆发的帖子中的流行病学特征进行分类:信息流行病学研究
J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.
8
From BERT to generative AI - Comparing encoder-only vs. large language models in a cohort of lung cancer patients for named entity recognition in unstructured medical reports.从BERT到生成式人工智能——在一组肺癌患者中比较仅编码器模型与大语言模型用于非结构化医疗报告中的命名实体识别
Comput Biol Med. 2025 Sep;195:110665. doi: 10.1016/j.compbiomed.2025.110665. Epub 2025 Jun 24.
9
Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot.大语言模型在医学胚胎学中的性能比较分析:ChatGPT、Claude、Gemini和Copilot的跨平台研究
Anat Sci Educ. 2025 May 11. doi: 10.1002/ase.70044.
10
Predicting 30-Day Postoperative Mortality and American Society of Anesthesiologists Physical Status Using Retrieval-Augmented Large Language Models: Development and Validation Study.使用检索增强大语言模型预测术后30天死亡率和美国麻醉医师协会身体状况:开发与验证研究
J Med Internet Res. 2025 Jun 3;27:e75052. doi: 10.2196/75052.

引用本文的文献

1
Using Large Languge Models for Processing Sensor Data.使用大语言模型处理传感器数据。
Sensors (Basel). 2025 Jul 13;25(14):4380. doi: 10.3390/s25144380.
2
Train-Time and Test-Time Computation in Large Language Models for Error Detection and Correction in Electronic Medical Records: A Retrospective Study.用于电子病历错误检测与纠正的大语言模型中的训练时和测试时计算:一项回顾性研究
Diagnostics (Basel). 2025 Jul 21;15(14):1829. doi: 10.3390/diagnostics15141829.
3
Leveraging Large Language Models for Accurate Retrieval of Patient Information From Medical Reports: Systematic Evaluation Study.

本文引用的文献

1
Artificial intelligence in cardiovascular medicine: clinical applications.人工智能在心血管医学中的应用:临床应用。
Eur Heart J. 2024 Oct 21;45(40):4291-4304. doi: 10.1093/eurheartj/ehae465.
2
A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports.基于大语言模型的零样本推理与乳腺癌病理报告任务特定监督分类的比较研究。
J Am Med Inform Assoc. 2024 Oct 1;31(10):2315-2327. doi: 10.1093/jamia/ocae146.
3
Approach to machine learning for extraction of real-world data variables from electronic health records.
利用大语言模型从医学报告中准确检索患者信息:系统评价研究
JMIR AI. 2025 Jul 3;4:e68776. doi: 10.2196/68776.
4
Artificial intelligence to revolutionize IBD clinical trials: a comprehensive review.人工智能将彻底改变炎症性肠病临床试验:全面综述。
Therap Adv Gastroenterol. 2025 Feb 23;18:17562848251321915. doi: 10.1177/17562848251321915. eCollection 2025.
从电子健康记录中提取真实世界数据变量的机器学习方法。
Front Pharmacol. 2023 Sep 15;14:1180962. doi: 10.3389/fphar.2023.1180962. eCollection 2023.
4
Revolutionizing healthcare: the role of artificial intelligence in clinical practice.人工智能在临床实践中的应用:医疗保健的革命。
BMC Med Educ. 2023 Sep 22;23(1):689. doi: 10.1186/s12909-023-04698-z.
5
Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.GPT-4作为医学人工智能聊天机器人的益处、局限性和风险
N Engl J Med. 2023 Mar 30;388(13):1233-1239. doi: 10.1056/NEJMsr2214184.
6
Machine Learning Methods in Health Economics and Outcomes Research-The PALISADE Checklist: A Good Practices Report of an ISPOR Task Force.机器学习方法在健康经济学和结果研究中的应用-PALISADE 清单:ISPOR 工作组的良好实践报告。
Value Health. 2022 Jul;25(7):1063-1080. doi: 10.1016/j.jval.2022.03.022.
7
Comparing automated vs. manual data collection for COVID-specific medications from electronic health records.比较电子健康记录中 COVID 特定药物的自动数据采集与手动数据采集。
Int J Med Inform. 2022 Jan;157:104622. doi: 10.1016/j.ijmedinf.2021.104622. Epub 2021 Oct 21.
8
Transcription Error Rates in Retrospective Chart Reviews.回顾性图表审查中的转录错误率。
Orthopedics. 2020 Sep 1;43(5):e404-e408. doi: 10.3928/01477447-20200619-10. Epub 2020 Jul 7.
9
Development of a data collection and management system in West Africa: challenges and sustainability.西非数据收集和管理系统的开发:挑战与可持续性。
Infect Dis Poverty. 2018 Nov 16;7(1):125. doi: 10.1186/s40249-018-0494-4.
10
Time Spent on Dedicated Patient Care and Documentation Tasks Before and After the Introduction of a Structured and Standardized Electronic Health Record.引入结构化和标准化电子健康记录前后用于专门患者护理和文档任务的时间
Appl Clin Inform. 2018 Jan;9(1):46-53. doi: 10.1055/s-0037-1615747. Epub 2018 Jan 17.