Ntinopoulos Vasileios, Rodriguez Cetina Biefer Hector, Tudorache Igor, Papadopoulos Nestoras, Odavic Dragan, Risteski Petar, Haeussler Achim, Dzemali Omer
Department of Cardiac Surgery, University Hospital of Zurich, Zurich, Switzerland.
Department of Cardiac Surgery, Municipal Hospital of Zurich - Triemli, Zurich, Switzerland.
BMJ Health Care Inform. 2025 Jan 19;32(1):e101139. doi: 10.1136/bmjhci-2024-101139.
We aimed to evaluate the performance of multiple large language models (LLMs) in data extraction from unstructured and semi-structured electronic health records.
50 synthetic medical notes in English, containing a structured and an unstructured part, were drafted and evaluated by domain experts, and subsequently used for LLM-prompting. 18 LLMs were evaluated against a baseline transformer-based model. Performance assessment comprised four entity extraction and five binary classification tasks with a total of 450 predictions for each LLM. LLM-response consistency assessment was performed over three same-prompt iterations.
Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b exhibited an excellent overall accuracy >0.98 (0.995, 0.988, 0.988, 0.988, 0.986, 0.982, 0.982, and 0.982, respectively), significantly higher than the baseline RoBERTa model (0.742). Claude 2.0, Claude 2.1, Claude 3.0 Opus, PaLM 2 chat-bison, GPT 4, Claude 3.0 Sonnet and Llama 3-70b showed a marginally higher and Gemini Advanced a marginally lower multiple-run consistency than the baseline model RoBERTa (Krippendorff's alpha value 1, 0.998, 0.996, 0.996, 0.992, 0.991, 0.989, 0.988, and 0.985, respectively).
Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat bison and Llama 3-70b performed the best, exhibiting outstanding performance in both entity extraction and binary classification, with highly consistent responses over multiple same-prompt iterations. Their use could leverage data for research and unburden healthcare professionals. Real-data analyses are warranted to confirm their performance in a real-world setting.
Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b seem to be able to reliably extract data from unstructured and semi-structured electronic health records. Further analyses using real data are warranted to confirm their performance in a real-world setting.
我们旨在评估多个大语言模型(LLMs)从非结构化和半结构化电子健康记录中提取数据的性能。
起草了50份英文合成医疗记录,包含结构化和非结构化部分,由领域专家进行评估,随后用于大语言模型提示。针对基于基线变压器的模型评估了18个大语言模型。性能评估包括四项实体提取和五项二元分类任务,每个大语言模型共有450个预测。在三次相同提示迭代中进行大语言模型响应一致性评估。
Claude 3.0 Opus、Claude 3.0 Sonnet、Claude 2.0、GPT 4、Claude 2.1、Gemini Advanced、PaLM 2 chat-bison和Llama 3-70b表现出出色的总体准确率>0.98(分别为0.995、0.988、0.988、0.988、0.986、0.982、0.982和0.982),显著高于基线RoBERTa模型(0.742)。Claude 2.0、Claude 2.1、Claude 3.0 Opus、PaLM 2 chat-bison、GPT 4、Claude 3.0 Sonnet和Llama 3-70b的多次运行一致性略高于基线模型RoBERTa,而Gemini Advanced略低于基线模型RoBERTa(Krippendorff's alpha值分别为1、0.998、0.996、0.996、0.992、0.991、0.989、0.988和0.985)。
Claude 3.0 Opus、Claude 3.0 Sonnet、Claude 2.0、GPT 4、Claude 2.1、Gemini Advanced、PaLM 2 chat bison和Llama 3-70b表现最佳,在实体提取和二元分类方面均表现出色,在多次相同提示迭代中响应高度一致。它们的使用可以利用数据进行研究并减轻医疗保健专业人员的负担。有必要进行实际数据分析以确认它们在现实环境中的性能。
Claude 3.0 Opus、Claude 3.0 Sonnet、Claude 2.0、GPT 4、Claude 2.1、Gemini Advanced、PaLM 2 chat-bison和Llama 3-70b似乎能够可靠地从非结构化和半结构化电子健康记录中提取数据。有必要使用实际数据进行进一步分析以确认它们在现实环境中的性能。