Suppr超能文献

用于从非结构化和半结构化电子健康记录中提取数据的大语言模型:多模型性能评估

Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation.

作者信息

Ntinopoulos Vasileios, Rodriguez Cetina Biefer Hector, Tudorache Igor, Papadopoulos Nestoras, Odavic Dragan, Risteski Petar, Haeussler Achim, Dzemali Omer

机构信息

Department of Cardiac Surgery, University Hospital of Zurich, Zurich, Switzerland.

Department of Cardiac Surgery, Municipal Hospital of Zurich - Triemli, Zurich, Switzerland.

出版信息

BMJ Health Care Inform. 2025 Jan 19;32(1):e101139. doi: 10.1136/bmjhci-2024-101139.

Abstract

OBJECTIVES

We aimed to evaluate the performance of multiple large language models (LLMs) in data extraction from unstructured and semi-structured electronic health records.

METHODS

50 synthetic medical notes in English, containing a structured and an unstructured part, were drafted and evaluated by domain experts, and subsequently used for LLM-prompting. 18 LLMs were evaluated against a baseline transformer-based model. Performance assessment comprised four entity extraction and five binary classification tasks with a total of 450 predictions for each LLM. LLM-response consistency assessment was performed over three same-prompt iterations.

RESULTS

Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b exhibited an excellent overall accuracy >0.98 (0.995, 0.988, 0.988, 0.988, 0.986, 0.982, 0.982, and 0.982, respectively), significantly higher than the baseline RoBERTa model (0.742). Claude 2.0, Claude 2.1, Claude 3.0 Opus, PaLM 2 chat-bison, GPT 4, Claude 3.0 Sonnet and Llama 3-70b showed a marginally higher and Gemini Advanced a marginally lower multiple-run consistency than the baseline model RoBERTa (Krippendorff's alpha value 1, 0.998, 0.996, 0.996, 0.992, 0.991, 0.989, 0.988, and 0.985, respectively).

DISCUSSION

Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat bison and Llama 3-70b performed the best, exhibiting outstanding performance in both entity extraction and binary classification, with highly consistent responses over multiple same-prompt iterations. Their use could leverage data for research and unburden healthcare professionals. Real-data analyses are warranted to confirm their performance in a real-world setting.

CONCLUSION

Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b seem to be able to reliably extract data from unstructured and semi-structured electronic health records. Further analyses using real data are warranted to confirm their performance in a real-world setting.

摘要

目标

我们旨在评估多个大语言模型(LLMs)从非结构化和半结构化电子健康记录中提取数据的性能。

方法

起草了50份英文合成医疗记录,包含结构化和非结构化部分,由领域专家进行评估,随后用于大语言模型提示。针对基于基线变压器的模型评估了18个大语言模型。性能评估包括四项实体提取和五项二元分类任务,每个大语言模型共有450个预测。在三次相同提示迭代中进行大语言模型响应一致性评估。

结果

Claude 3.0 Opus、Claude 3.0 Sonnet、Claude 2.0、GPT 4、Claude 2.1、Gemini Advanced、PaLM 2 chat-bison和Llama 3-70b表现出出色的总体准确率>0.98(分别为0.995、0.988、0.988、0.988、0.986、0.982、0.982和0.982),显著高于基线RoBERTa模型(0.742)。Claude 2.0、Claude 2.1、Claude 3.0 Opus、PaLM 2 chat-bison、GPT 4、Claude 3.0 Sonnet和Llama 3-70b的多次运行一致性略高于基线模型RoBERTa,而Gemini Advanced略低于基线模型RoBERTa(Krippendorff's alpha值分别为1、0.998、0.996、0.996、0.992、0.991、0.989、0.988和0.985)。

讨论

Claude 3.0 Opus、Claude 3.0 Sonnet、Claude 2.0、GPT 4、Claude 2.1、Gemini Advanced、PaLM 2 chat bison和Llama 3-70b表现最佳,在实体提取和二元分类方面均表现出色,在多次相同提示迭代中响应高度一致。它们的使用可以利用数据进行研究并减轻医疗保健专业人员的负担。有必要进行实际数据分析以确认它们在现实环境中的性能。

结论

Claude 3.0 Opus、Claude 3.0 Sonnet、Claude 2.0、GPT 4、Claude 2.1、Gemini Advanced、PaLM 2 chat-bison和Llama 3-70b似乎能够可靠地从非结构化和半结构化电子健康记录中提取数据。有必要使用实际数据进行进一步分析以确认它们在现实环境中的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/541e/11751965/1f5a809b754e/bmjhci-32-1-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验