Suppr超能文献

利用合成医疗保健数据借助大语言模型进行命名实体识别:开发与验证研究。

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study.

作者信息

Šuvalov Hendrik, Lepson Mihkel, Kukk Veronika, Malk Maria, Ilves Neeme, Kuulmets Hele-Andra, Kolde Raivo

机构信息

Institute of Computer Science, University of Tartu, Tartu, Estonia.

出版信息

J Med Internet Res. 2025 Mar 18;27:e66279. doi: 10.2196/66279.

Abstract

BACKGROUND

Named entity recognition (NER) plays a vital role in extracting critical medical entities from health care records, facilitating applications such as clinical decision support and data mining. Developing robust NER models for low-resource languages, such as Estonian, remains a challenge due to the scarcity of annotated data and domain-specific pretrained models. Large language models (LLMs) have proven to be promising in understanding text from any language or domain.

OBJECTIVE

This study addresses the development of medical NER models for low-resource languages, specifically Estonian. We propose a novel approach by generating synthetic health care data and using LLMs to annotate them. These synthetic data are then used to train a high-performing NER model, which is applied to real-world medical texts, preserving patient data privacy.

METHODS

Our approach to overcoming the shortage of annotated Estonian health care texts involves a three-step pipeline: (1) synthetic health care data are generated using a locally trained GPT-2 model on Estonian medical records, (2) the synthetic data are annotated with LLMs, specifically GPT-3.5-Turbo and GPT-4, and (3) the annotated synthetic data are then used to fine-tune an NER model, which is later tested on real-world medical data. This paper compares the performance of different prompts; assesses the impact of GPT-3.5-Turbo, GPT-4, and a local LLM; and explores the relationship between the amount of annotated synthetic data and model performance.

RESULTS

The proposed methodology demonstrates significant potential in extracting named entities from real-world medical texts. Our top-performing setup achieved an F-score of 0.69 for drug extraction and 0.38 for procedure extraction. These results indicate a strong performance in recognizing certain entity types while highlighting the complexity of extracting procedures.

CONCLUSIONS

This paper demonstrates a successful approach to leveraging LLMs for training NER models using synthetic data, effectively preserving patient privacy. By avoiding reliance on human-annotated data, our method shows promise in developing models for low-resource languages, such as Estonian. Future work will focus on refining the synthetic data generation and expanding the method's applicability to other domains and languages.

摘要

背景

命名实体识别(NER)在从医疗记录中提取关键医学实体方面发挥着至关重要的作用,有助于临床决策支持和数据挖掘等应用。由于注释数据和特定领域预训练模型的稀缺,为爱沙尼亚语等低资源语言开发强大的NER模型仍然是一项挑战。大语言模型(LLMs)已被证明在理解任何语言或领域的文本方面很有前景。

目的

本研究致力于为低资源语言(特别是爱沙尼亚语)开发医学NER模型。我们提出了一种新颖的方法,即生成合成医疗数据并使用大语言模型对其进行注释。然后,这些合成数据被用于训练一个高性能的NER模型,该模型应用于真实世界的医学文本,同时保护患者数据隐私。

方法

我们克服爱沙尼亚语医疗文本注释短缺的方法包括一个三步流程:(1)使用在爱沙尼亚语医疗记录上本地训练的GPT-2模型生成合成医疗数据;(2)使用大语言模型(特别是GPT-3.5-Turbo和GPT-4)对合成数据进行注释;(3)然后使用注释后的合成数据对NER模型进行微调,随后在真实世界的医学数据上进行测试。本文比较了不同提示的性能;评估了GPT-3.5-Turbo、GPT-4和本地大语言模型的影响;并探讨了注释合成数据量与模型性能之间的关系。

结果

所提出的方法在从真实世界医学文本中提取命名实体方面显示出巨大潜力。我们表现最佳的设置在药物提取方面的F分数为0.69,在手术提取方面为0.38。这些结果表明在识别某些实体类型方面表现强劲,同时突出了提取手术信息的复杂性。

结论

本文展示了一种利用大语言模型通过合成数据训练NER模型的成功方法,有效保护了患者隐私。通过避免依赖人工注释数据,我们的方法在为低资源语言(如爱沙尼亚语)开发模型方面显示出前景。未来的工作将集中在优化合成数据生成,并将该方法的适用性扩展到其他领域和语言。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4d84/11962312/dcad6271491c/jmir_v27i1e66279_fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验