文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

通过大型语言模型创建非英语医学自然语言处理的标注数据集。

Annotated dataset creation through large language models for non-english medical NLP.

机构信息

IT-Infrastructure for Translational Medical Research, University of Augsburg Alter Postweg 101, 86159 Augsburg, Germany.

出版信息

J Biomed Inform. 2023 Sep;145:104478. doi: 10.1016/j.jbi.2023.104478. Epub 2023 Aug 23.


DOI:10.1016/j.jbi.2023.104478
PMID:37625508
Abstract

Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervised training in natural language processing (NLP). In general, developing and applying new NLP pipelines in domain-specific contexts for tasks often requires custom-designed datasets to address NLP tasks in a supervised machine learning fashion. When operating in non-English languages for medical data processing, this exposes several minor and major, interconnected problems such as the lack of task-matching datasets as well as task-specific pre-trained models. In our work, we suggest to leverage pre-trained large language models for training data acquisition in order to retrieve sufficiently large datasets for training smaller and more efficient models for use-case-specific tasks. To demonstrate the effectiveness of your approach, we create a custom dataset that we use to train a medical NER model for German texts, GPTNERMED, yet our method remains language-independent in principle. Our obtained dataset as well as our pre-trained models are publicly available at https://github.com/frankkramer-lab/GPTNERMED.

摘要

获取具有语义注释的文本数据集是一项费力的工作,但对于自然语言处理 (NLP) 的监督训练至关重要。通常,在特定领域的上下文中开发和应用新的 NLP 管道来完成任务通常需要定制设计的数据集,以便以监督机器学习的方式解决 NLP 任务。在使用非英语语言处理医疗数据时,这会暴露出一些较小和较大的、相互关联的问题,例如缺乏任务匹配的数据集以及特定于任务的预训练模型。在我们的工作中,我们建议利用预先训练好的大型语言模型来获取训练数据,以便为特定于用例的任务检索足够大的数据集来训练更小、更高效的模型。为了证明我们方法的有效性,我们创建了一个自定义数据集,用于训练德语文本的医学命名实体识别模型 GPTNERMED,但我们的方法在原则上是与语言无关的。我们获得的数据集以及我们的预训练模型均可在 https://github.com/frankkramer-lab/GPTNERMED 上获得。

相似文献

[1]
Annotated dataset creation through large language models for non-english medical NLP.

J Biomed Inform. 2023-9

[2]
GERNERMED++: Semantic annotation in German medical NLP through transfer-learning, translation and word alignment.

J Biomed Inform. 2023-11

[3]
A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018-9-12

[4]
Biomedical and clinical English model packages for the Stanza Python NLP library.

J Am Med Inform Assoc. 2021-8-13

[5]
From zero to hero: Harnessing transformers for biomedical named entity recognition in zero- and few-shot contexts.

Artif Intell Med. 2024-10

[6]
Task definition, annotated dataset, and supervised natural language processing models for symptom extraction from unstructured clinical notes.

J Biomed Inform. 2020-2

[7]
Effects of data and entity ablation on multitask learning models for biomedical entity recognition.

J Biomed Inform. 2022-6

[8]
Transformers-sklearn: a toolkit for medical language understanding with transformer-based models.

BMC Med Inform Decis Mak. 2021-7-30

[9]
Local contrastive loss with pseudo-label based self-training for semi-supervised medical image segmentation.

Med Image Anal. 2023-7

[10]
Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey.

Yearb Med Inform. 2023-8

引用本文的文献

[1]
Clinical document corpora-real ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German text data.

JAMIA Open. 2025-5-14

[2]
Preliminary assessment of large language models' performance in answering questions on developmental dysplasia of the hip.

J Child Orthop. 2025-4-15

[3]
Year 2023 in Biomedical Natural Language Processing: a Tribute to Large Language Models and Generative AI.

Yearb Med Inform. 2024-8

[4]
Leveraging large language models for knowledge-free weak supervision in clinical natural language processing.

Sci Rep. 2025-3-10

[5]
Deploying large language models for discourse studies: An exploration of automated analysis of media attitudes.

PLoS One. 2025-1-9

[6]
Open-source LLMs for text annotation: a practical guide for model setting and fine-tuning.

J Comput Soc Sci. 2025

[7]
Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis.

J Healthc Inform Res. 2024-9-14

[8]
A GPT-based EHR modeling system for unsupervised novel disease detection.

J Biomed Inform. 2024-9

[9]
Leveraging Large Language Models for Knowledge-free Weak Supervision in Clinical Natural Language Processing.

Res Sq. 2024-6-28

[10]
Annotation-preserving machine translation of English corpora to validate Dutch clinical concept extraction tools.

J Am Med Inform Assoc. 2024-8-1

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索