• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过大型语言模型创建非英语医学自然语言处理的标注数据集。

Annotated dataset creation through large language models for non-english medical NLP.

机构信息

IT-Infrastructure for Translational Medical Research, University of Augsburg Alter Postweg 101, 86159 Augsburg, Germany.

出版信息

J Biomed Inform. 2023 Sep;145:104478. doi: 10.1016/j.jbi.2023.104478. Epub 2023 Aug 23.

DOI:10.1016/j.jbi.2023.104478
PMID:37625508
Abstract

Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervised training in natural language processing (NLP). In general, developing and applying new NLP pipelines in domain-specific contexts for tasks often requires custom-designed datasets to address NLP tasks in a supervised machine learning fashion. When operating in non-English languages for medical data processing, this exposes several minor and major, interconnected problems such as the lack of task-matching datasets as well as task-specific pre-trained models. In our work, we suggest to leverage pre-trained large language models for training data acquisition in order to retrieve sufficiently large datasets for training smaller and more efficient models for use-case-specific tasks. To demonstrate the effectiveness of your approach, we create a custom dataset that we use to train a medical NER model for German texts, GPTNERMED, yet our method remains language-independent in principle. Our obtained dataset as well as our pre-trained models are publicly available at https://github.com/frankkramer-lab/GPTNERMED.

摘要

获取具有语义注释的文本数据集是一项费力的工作,但对于自然语言处理 (NLP) 的监督训练至关重要。通常,在特定领域的上下文中开发和应用新的 NLP 管道来完成任务通常需要定制设计的数据集,以便以监督机器学习的方式解决 NLP 任务。在使用非英语语言处理医疗数据时,这会暴露出一些较小和较大的、相互关联的问题,例如缺乏任务匹配的数据集以及特定于任务的预训练模型。在我们的工作中,我们建议利用预先训练好的大型语言模型来获取训练数据,以便为特定于用例的任务检索足够大的数据集来训练更小、更高效的模型。为了证明我们方法的有效性,我们创建了一个自定义数据集,用于训练德语文本的医学命名实体识别模型 GPTNERMED,但我们的方法在原则上是与语言无关的。我们获得的数据集以及我们的预训练模型均可在 https://github.com/frankkramer-lab/GPTNERMED 上获得。

相似文献

1
Annotated dataset creation through large language models for non-english medical NLP.通过大型语言模型创建非英语医学自然语言处理的标注数据集。
J Biomed Inform. 2023 Sep;145:104478. doi: 10.1016/j.jbi.2023.104478. Epub 2023 Aug 23.
2
GERNERMED++: Semantic annotation in German medical NLP through transfer-learning, translation and word alignment.GERNERMED++:通过迁移学习、翻译和词对齐实现德语医学自然语言处理中的语义标注。
J Biomed Inform. 2023 Nov;147:104513. doi: 10.1016/j.jbi.2023.104513. Epub 2023 Oct 13.
3
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
4
Biomedical and clinical English model packages for the Stanza Python NLP library.适用于Stanza Python自然语言处理库的生物医学和临床英语模型包。
J Am Med Inform Assoc. 2021 Aug 13;28(9):1892-1899. doi: 10.1093/jamia/ocab090.
5
From zero to hero: Harnessing transformers for biomedical named entity recognition in zero- and few-shot contexts.从零到英雄:利用变压器在零样本和少样本上下文中进行生物医学命名实体识别。
Artif Intell Med. 2024 Oct;156:102970. doi: 10.1016/j.artmed.2024.102970. Epub 2024 Aug 24.
6
Task definition, annotated dataset, and supervised natural language processing models for symptom extraction from unstructured clinical notes.从非结构化临床记录中提取症状的任务定义、标注数据集和监督自然语言处理模型。
J Biomed Inform. 2020 Feb;102:103354. doi: 10.1016/j.jbi.2019.103354. Epub 2019 Dec 12.
7
Effects of data and entity ablation on multitask learning models for biomedical entity recognition.数据和实体消融对生物医学实体识别多任务学习模型的影响。
J Biomed Inform. 2022 Jun;130:104062. doi: 10.1016/j.jbi.2022.104062. Epub 2022 Apr 9.
8
Transformers-sklearn: a toolkit for medical language understanding with transformer-based models.Transformer-sklearn:一个基于 Transformer 的模型的医学语言理解工具包。
BMC Med Inform Decis Mak. 2021 Jul 30;21(Suppl 2):90. doi: 10.1186/s12911-021-01459-0.
9
Local contrastive loss with pseudo-label based self-training for semi-supervised medical image segmentation.基于伪标签自训练的局部对比损失的半监督医学图像分割。
Med Image Anal. 2023 Jul;87:102792. doi: 10.1016/j.media.2023.102792. Epub 2023 Mar 11.
10
Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey.探索多语言医学自然语言处理的最新亮点:综述。
Yearb Med Inform. 2023 Aug;32(1):230-243. doi: 10.1055/s-0043-1768726. Epub 2023 Dec 26.

引用本文的文献

1
Clinical document corpora-real ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German text data.临床文档语料库——真实语料库、翻译语料库和合成替代语料库,以及各类领域替代语料库:语料库设计多样性调查,重点关注德语文本数据
JAMIA Open. 2025 May 14;8(3):ooaf024. doi: 10.1093/jamiaopen/ooaf024. eCollection 2025 Jun.
2
Preliminary assessment of large language models' performance in answering questions on developmental dysplasia of the hip.大语言模型在回答有关发育性髋关节发育不良问题时的性能初步评估。
J Child Orthop. 2025 Apr 15:18632521251331772. doi: 10.1177/18632521251331772.
3
Year 2023 in Biomedical Natural Language Processing: a Tribute to Large Language Models and Generative AI.
2023年生物医学自然语言处理领域:向大语言模型和生成式人工智能致敬。
Yearb Med Inform. 2024 Aug;33(1):241-248. doi: 10.1055/s-0044-1800751. Epub 2025 Apr 8.
4
Leveraging large language models for knowledge-free weak supervision in clinical natural language processing.在临床自然语言处理中利用大语言模型进行无知识弱监督。
Sci Rep. 2025 Mar 10;15(1):8241. doi: 10.1038/s41598-024-68168-2.
5
Deploying large language models for discourse studies: An exploration of automated analysis of media attitudes.将大语言模型应用于话语研究:媒体态度自动分析的探索
PLoS One. 2025 Jan 9;20(1):e0313932. doi: 10.1371/journal.pone.0313932. eCollection 2025.
6
Open-source LLMs for text annotation: a practical guide for model setting and fine-tuning.用于文本标注的开源语言模型:模型设置与微调实用指南。
J Comput Soc Sci. 2025;8(1):17. doi: 10.1007/s42001-024-00345-9. Epub 2024 Dec 18.
7
Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis.生物医学与健康信息学中的大语言模型:文献计量分析综述
J Healthc Inform Res. 2024 Sep 14;8(4):658-711. doi: 10.1007/s41666-024-00171-8. eCollection 2024 Dec.
8
A GPT-based EHR modeling system for unsupervised novel disease detection.基于 GPT 的电子健康记录建模系统,用于无监督的新型疾病检测。
J Biomed Inform. 2024 Sep;157:104706. doi: 10.1016/j.jbi.2024.104706. Epub 2024 Aug 8.
9
Leveraging Large Language Models for Knowledge-free Weak Supervision in Clinical Natural Language Processing.利用大语言模型进行临床自然语言处理中的无知识弱监督
Res Sq. 2024 Jun 28:rs.3.rs-4559971. doi: 10.21203/rs.3.rs-4559971/v1.
10
Annotation-preserving machine translation of English corpora to validate Dutch clinical concept extraction tools.利用标注保留的机器翻译将英文语料库翻译为荷兰文,以验证荷兰临床概念提取工具。
J Am Med Inform Assoc. 2024 Aug 1;31(8):1725-1734. doi: 10.1093/jamia/ocae159.