通过大型语言模型创建非英语医学自然语言处理的标注数据集。

Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervised training in natural language processing (NLP). In general, developing and applying new NLP pipelines in domain-specific contexts for tasks often requires custom-designed datasets to address NLP tasks in a supervised machine learning fashion. When operating in non-English languages for medical data processing, this exposes several minor and major, interconnected problems such as the lack of task-matching datasets as well as task-specific pre-trained models. In our work, we suggest to leverage pre-trained large language models for training data acquisition in order to retrieve sufficiently large datasets for training smaller and more efficient models for use-case-specific tasks. To demonstrate the effectiveness of your approach, we create a custom dataset that we use to train a medical NER model for German texts, GPTNERMED, yet our method remains language-independent in principle. Our obtained dataset as well as our pre-trained models are publicly available at https://github.com/frankkramer-lab/GPTNERMED.

获取具有语义注释的文本数据集是一项费力的工作，但对于自然语言处理 (NLP) 的监督训练至关重要。通常，在特定领域的上下文中开发和应用新的 NLP 管道来完成任务通常需要定制设计的数据集，以便以监督机器学习的方式解决 NLP 任务。在使用非英语语言处理医疗数据时，这会暴露出一些较小和较大的、相互关联的问题，例如缺乏任务匹配的数据集以及特定于任务的预训练模型。在我们的工作中，我们建议利用预先训练好的大型语言模型来获取训练数据，以便为特定于用例的任务检索足够大的数据集来训练更小、更高效的模型。为了证明我们方法的有效性，我们创建了一个自定义数据集，用于训练德语文本的医学命名实体识别模型 GPTNERMED，但我们的方法在原则上是与语言无关的。我们获得的数据集以及我们的预训练模型均可在 https://github.com/frankkramer-lab/GPTNERMED 上获得。

新学期，新优惠

Suppr 超能文献

新学期，新优惠

Suppr 超能文献

Annotated dataset creation through large language models for non-english medical NLP.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

推荐工具