Idrissi-Yaghir Ahmad, Arzideh Kamyar, Schäfer Henning, Eryilmaz Bahadir, Bahn Mikel, Wen Yutong, Borys Katarzyna, Hartmann Eva, Schmidt Cynthia, Pelka Obioma, Haubold Johannes, Friedrich Christoph M, Nensa Felix, Hosch René
Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Essen, Germany.
Institute for Artificial Intelligence in Medicine, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany, 49 201 - 72377817.
J Med Internet Res. 2025 Aug 12;27:e73540. doi: 10.2196/73540.
Recent natural language processing breakthroughs, particularly with the emergence of large language models (LLMs), have demonstrated remarkable capabilities on general knowledge benchmarks. However, there is limited data on the performance and understanding of these models in relation to the Fast Healthcare Interoperability Resources (FHIR) standard. The complexity and specialized nature of FHIR present challenges for LLMs, which are typically trained on broad datasets and may have a limited understanding of the nuances required for domain-specific tasks. Improving health data interoperability can greatly benefit the use of clinical data and interaction with electronic health records.
This study presents the Fast Healthcare Interoperability Resources (FHIR) Workbench, a comprehensive suite of datasets designed to evaluate the ability of LLMs to understand and apply the FHIR standard.
In total, 4 evaluation datasets were created to assess the FHIR knowledge and capabilities of LLMs. These tasks include multiple-choice questions on general FHIR concepts and the FHIR Representational State Transfer (REST) application programming interface, as well as correctly identifying the resource type and generating FHIR resources from unstructured clinical patient notes. In addition, we evaluate open-source LLMs, such as Qwen 2.5 Coder and DeepSeek-V3, and commercial LLMs, including GPT-4o and Gemini 2, on these tasks in a zero-shot setting. To provide context for interpreting LLM performance, a subset of the datasets was human-evaluated by recruiting 6 participants with varying levels of FHIR expertise.
Our evaluation across multiple FHIR tasks revealed nuanced performance metrics. Commercial models demonstrated exceptional capabilities, with GPT-4o achieving a 0.9990 F1-score on the FHIR-ResourceID task, 0.9400 on the FHIR-QA task, and 0.9267 on the FHIR-RESTQA task. Open-source models also demonstrated strong performance, with DeepSeek-v3 achieving 0.9400 on FHIR-QA, 0.9400 on FHIR-RESTQA, and 0.9142 on FHIR-ResourceID. Qwen 2.5 Coder-7B-Instruct demonstrated high accuracy, scoring 0.9533 on FHIR-QA and 0.8920 on FHIR-ResourceID. However, all models struggled with the Note2FHIR task, with performance ranging from 0.0382 (OLMo) to a maximum of 0.3633 (GPT-4.5-preview), highlighting the significant challenge of converting unstructured clinical text into FHIR-compliant resources. Human participants achieved accuracy scores ranging from 0.50 to 1.0 across the first 3 tasks.
This study highlights the competitive performance of both open-source models, such as Qwen and DeepSeek, and commercial models, such as GPT-4o and Gemini, in FHIR-related tasks. While open-source models are advancing rapidly, commercial models still have an advantage for specific, complex tasks. The FHIR Workbench offers a valuable platform for evaluating the capabilities of these models and promoting improvements in health data interoperability.
近期自然语言处理领域的突破,尤其是随着大语言模型(LLMs)的出现,在通用知识基准测试中展现出了卓越的能力。然而,关于这些模型在快速医疗保健互操作性资源(FHIR)标准方面的性能和理解的数据有限。FHIR的复杂性和专业性给大语言模型带来了挑战,这些模型通常是在广泛的数据集上进行训练的,可能对特定领域任务所需的细微差别理解有限。提高健康数据的互操作性可以极大地促进临床数据的使用以及与电子健康记录的交互。
本研究展示了快速医疗保健互操作性资源(FHIR)工作台,这是一套全面的数据集,旨在评估大语言模型理解和应用FHIR标准的能力。
总共创建了4个评估数据集,以评估大语言模型的FHIR知识和能力。这些任务包括关于一般FHIR概念和FHIR代表性状态转移(REST)应用程序编程接口的多项选择题,以及正确识别资源类型并从非结构化临床患者记录中生成FHIR资源。此外,我们在零样本设置下,对开源大语言模型(如Qwen 2.5 Coder和DeepSeek-V3)以及商业大语言模型(包括GPT-4o和Gemini 2)进行这些任务的评估。为了解释大语言模型的性能,通过招募6名具有不同FHIR专业水平的参与者,对部分数据集进行了人工评估。
我们在多个FHIR任务上的评估揭示了细微差别明显的性能指标。商业模型展示出卓越能力,GPT-4o在FHIR-ResourceID任务上的F1分数达到0.9990,在FHIR-QA任务上为0.9400,在FHIR-RESTQA任务上为0.9267。开源模型也表现出强大性能,DeepSeek-v3在FHIR-QA上为0.9400,在FHIR-RESTQA上为0.9400,在FHIR-ResourceID上为0.9142。Qwen 2.5 Coder-7B-Instruct表现出高准确率,在FHIR-QA上得分为0.9533,在FHIR-ResourceID上为0.8920。然而,所有模型在Note2FHIR任务上都面临困难,性能从0.0382(OLMo)到最高0.3633(GPT-4.5-preview)不等,凸显了将非结构化临床文本转换为符合FHIR标准的资源这一重大挑战。在前3个任务中,人类参与者的准确率得分在0.50到1.0之间。
本研究突出了开源模型(如Qwen和DeepSeek)以及商业模型(如GPT-4o和Gemini)在FHIR相关任务中的竞争性能。虽然开源模型正在迅速发展,但商业模型在特定复杂任务上仍具有优势。FHIR工作台为评估这些模型的能力以及促进健康数据互操作性的改进提供了一个有价值的平台。