利用合成医疗保健数据借助大语言模型进行命名实体识别：开发与验证研究。

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study.

作者信息

Šuvalov Hendrik, Lepson Mihkel, Kukk Veronika, Malk Maria, Ilves Neeme, Kuulmets Hele-Andra, Kolde Raivo

机构信息

Institute of Computer Science, University of Tartu, Tartu, Estonia.

出版信息

J Med Internet Res. 2025 Mar 18;27:e66279. doi: 10.2196/66279.

DOI:10.2196/66279

PMID:40101227

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11962312/

Abstract

BACKGROUND

Named entity recognition (NER) plays a vital role in extracting critical medical entities from health care records, facilitating applications such as clinical decision support and data mining. Developing robust NER models for low-resource languages, such as Estonian, remains a challenge due to the scarcity of annotated data and domain-specific pretrained models. Large language models (LLMs) have proven to be promising in understanding text from any language or domain.

OBJECTIVE

This study addresses the development of medical NER models for low-resource languages, specifically Estonian. We propose a novel approach by generating synthetic health care data and using LLMs to annotate them. These synthetic data are then used to train a high-performing NER model, which is applied to real-world medical texts, preserving patient data privacy.

METHODS

Our approach to overcoming the shortage of annotated Estonian health care texts involves a three-step pipeline: (1) synthetic health care data are generated using a locally trained GPT-2 model on Estonian medical records, (2) the synthetic data are annotated with LLMs, specifically GPT-3.5-Turbo and GPT-4, and (3) the annotated synthetic data are then used to fine-tune an NER model, which is later tested on real-world medical data. This paper compares the performance of different prompts; assesses the impact of GPT-3.5-Turbo, GPT-4, and a local LLM; and explores the relationship between the amount of annotated synthetic data and model performance.

RESULTS

The proposed methodology demonstrates significant potential in extracting named entities from real-world medical texts. Our top-performing setup achieved an F-score of 0.69 for drug extraction and 0.38 for procedure extraction. These results indicate a strong performance in recognizing certain entity types while highlighting the complexity of extracting procedures.

CONCLUSIONS

This paper demonstrates a successful approach to leveraging LLMs for training NER models using synthetic data, effectively preserving patient privacy. By avoiding reliance on human-annotated data, our method shows promise in developing models for low-resource languages, such as Estonian. Future work will focus on refining the synthetic data generation and expanding the method's applicability to other domains and languages.

摘要

背景

命名实体识别（NER）在从医疗记录中提取关键医学实体方面发挥着至关重要的作用，有助于临床决策支持和数据挖掘等应用。由于注释数据和特定领域预训练模型的稀缺，为爱沙尼亚语等低资源语言开发强大的NER模型仍然是一项挑战。大语言模型（LLMs）已被证明在理解任何语言或领域的文本方面很有前景。

目的

本研究致力于为低资源语言（特别是爱沙尼亚语）开发医学NER模型。我们提出了一种新颖的方法，即生成合成医疗数据并使用大语言模型对其进行注释。然后，这些合成数据被用于训练一个高性能的NER模型，该模型应用于真实世界的医学文本，同时保护患者数据隐私。

方法

我们克服爱沙尼亚语医疗文本注释短缺的方法包括一个三步流程：（1）使用在爱沙尼亚语医疗记录上本地训练的GPT-2模型生成合成医疗数据；（2）使用大语言模型（特别是GPT-3.5-Turbo和GPT-4）对合成数据进行注释；（3）然后使用注释后的合成数据对NER模型进行微调，随后在真实世界的医学数据上进行测试。本文比较了不同提示的性能；评估了GPT-3.5-Turbo、GPT-4和本地大语言模型的影响；并探讨了注释合成数据量与模型性能之间的关系。