IT-Infrastructure for Translational Medical Research, University of Augsburg Alter Postweg 101, 86159 Augsburg, Germany.
J Biomed Inform. 2023 Sep;145:104478. doi: 10.1016/j.jbi.2023.104478. Epub 2023 Aug 23.
Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervised training in natural language processing (NLP). In general, developing and applying new NLP pipelines in domain-specific contexts for tasks often requires custom-designed datasets to address NLP tasks in a supervised machine learning fashion. When operating in non-English languages for medical data processing, this exposes several minor and major, interconnected problems such as the lack of task-matching datasets as well as task-specific pre-trained models. In our work, we suggest to leverage pre-trained large language models for training data acquisition in order to retrieve sufficiently large datasets for training smaller and more efficient models for use-case-specific tasks. To demonstrate the effectiveness of your approach, we create a custom dataset that we use to train a medical NER model for German texts, GPTNERMED, yet our method remains language-independent in principle. Our obtained dataset as well as our pre-trained models are publicly available at https://github.com/frankkramer-lab/GPTNERMED.
获取具有语义注释的文本数据集是一项费力的工作,但对于自然语言处理 (NLP) 的监督训练至关重要。通常,在特定领域的上下文中开发和应用新的 NLP 管道来完成任务通常需要定制设计的数据集,以便以监督机器学习的方式解决 NLP 任务。在使用非英语语言处理医疗数据时,这会暴露出一些较小和较大的、相互关联的问题,例如缺乏任务匹配的数据集以及特定于任务的预训练模型。在我们的工作中,我们建议利用预先训练好的大型语言模型来获取训练数据,以便为特定于用例的任务检索足够大的数据集来训练更小、更高效的模型。为了证明我们方法的有效性,我们创建了一个自定义数据集,用于训练德语文本的医学命名实体识别模型 GPTNERMED,但我们的方法在原则上是与语言无关的。我们获得的数据集以及我们的预训练模型均可在 https://github.com/frankkramer-lab/GPTNERMED 上获得。
J Biomed Inform. 2023-9
J Biomed Inform. 2018-9-12
J Am Med Inform Assoc. 2021-8-13
BMC Med Inform Decis Mak. 2021-7-30
J Healthc Inform Res. 2024-9-14
J Biomed Inform. 2024-9
J Am Med Inform Assoc. 2024-8-1