Šuvalov Hendrik, Lepson Mihkel, Kukk Veronika, Malk Maria, Ilves Neeme, Kuulmets Hele-Andra, Kolde Raivo
Institute of Computer Science, University of Tartu, Tartu, Estonia.
J Med Internet Res. 2025 Mar 18;27:e66279. doi: 10.2196/66279.
Named entity recognition (NER) plays a vital role in extracting critical medical entities from health care records, facilitating applications such as clinical decision support and data mining. Developing robust NER models for low-resource languages, such as Estonian, remains a challenge due to the scarcity of annotated data and domain-specific pretrained models. Large language models (LLMs) have proven to be promising in understanding text from any language or domain.
This study addresses the development of medical NER models for low-resource languages, specifically Estonian. We propose a novel approach by generating synthetic health care data and using LLMs to annotate them. These synthetic data are then used to train a high-performing NER model, which is applied to real-world medical texts, preserving patient data privacy.
Our approach to overcoming the shortage of annotated Estonian health care texts involves a three-step pipeline: (1) synthetic health care data are generated using a locally trained GPT-2 model on Estonian medical records, (2) the synthetic data are annotated with LLMs, specifically GPT-3.5-Turbo and GPT-4, and (3) the annotated synthetic data are then used to fine-tune an NER model, which is later tested on real-world medical data. This paper compares the performance of different prompts; assesses the impact of GPT-3.5-Turbo, GPT-4, and a local LLM; and explores the relationship between the amount of annotated synthetic data and model performance.
The proposed methodology demonstrates significant potential in extracting named entities from real-world medical texts. Our top-performing setup achieved an F-score of 0.69 for drug extraction and 0.38 for procedure extraction. These results indicate a strong performance in recognizing certain entity types while highlighting the complexity of extracting procedures.
This paper demonstrates a successful approach to leveraging LLMs for training NER models using synthetic data, effectively preserving patient privacy. By avoiding reliance on human-annotated data, our method shows promise in developing models for low-resource languages, such as Estonian. Future work will focus on refining the synthetic data generation and expanding the method's applicability to other domains and languages.
命名实体识别(NER)在从医疗记录中提取关键医学实体方面发挥着至关重要的作用,有助于临床决策支持和数据挖掘等应用。由于注释数据和特定领域预训练模型的稀缺,为爱沙尼亚语等低资源语言开发强大的NER模型仍然是一项挑战。大语言模型(LLMs)已被证明在理解任何语言或领域的文本方面很有前景。
本研究致力于为低资源语言(特别是爱沙尼亚语)开发医学NER模型。我们提出了一种新颖的方法,即生成合成医疗数据并使用大语言模型对其进行注释。然后,这些合成数据被用于训练一个高性能的NER模型,该模型应用于真实世界的医学文本,同时保护患者数据隐私。
我们克服爱沙尼亚语医疗文本注释短缺的方法包括一个三步流程:(1)使用在爱沙尼亚语医疗记录上本地训练的GPT-2模型生成合成医疗数据;(2)使用大语言模型(特别是GPT-3.5-Turbo和GPT-4)对合成数据进行注释;(3)然后使用注释后的合成数据对NER模型进行微调,随后在真实世界的医学数据上进行测试。本文比较了不同提示的性能;评估了GPT-3.5-Turbo、GPT-4和本地大语言模型的影响;并探讨了注释合成数据量与模型性能之间的关系。
所提出的方法在从真实世界医学文本中提取命名实体方面显示出巨大潜力。我们表现最佳的设置在药物提取方面的F分数为0.69,在手术提取方面为0.38。这些结果表明在识别某些实体类型方面表现强劲,同时突出了提取手术信息的复杂性。
本文展示了一种利用大语言模型通过合成数据训练NER模型的成功方法,有效保护了患者隐私。通过避免依赖人工注释数据,我们的方法在为低资源语言(如爱沙尼亚语)开发模型方面显示出前景。未来的工作将集中在优化合成数据生成,并将该方法的适用性扩展到其他领域和语言。