• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用机器翻译和词对齐创建德语医学命名实体识别模型和数据集:算法开发与验证

German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation.

作者信息

Frei Johann, Kramer Frank

机构信息

IT Infrastructure for Translational Medical Research, University of Augsburg, Augsburg, Germany.

出版信息

JMIR Form Res. 2023 Feb 28;7:e39077. doi: 10.2196/39077.

DOI:10.2196/39077
PMID:36853741
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10015355/
Abstract

BACKGROUND

Data mining in the field of medical data analysis often needs to rely solely on the processing of unstructured data to retrieve relevant data. For German natural language processing, few open medical neural named entity recognition (NER) models have been published before this work. A major issue can be attributed to the lack of German training data.

OBJECTIVE

We developed a synthetic data set and a novel German medical NER model for public access to demonstrate the feasibility of our approach. In order to bypass legal restrictions due to potential data leaks through model analysis, we did not make use of internal, proprietary data sets, which is a frequent veto factor for data set publication.

METHODS

The underlying German data set was retrieved by translation and word alignment of a public English data set. The data set served as a foundation for model training and evaluation. For demonstration purposes, our NER model follows a simple network architecture that is designed for low computational requirements.

RESULTS

The obtained data set consisted of 8599 sentences including 30,233 annotations. The model achieved a class frequency-averaged F score of 0.82 on the test set after training across 7 different NER types. Artifacts in the synthesized data set with regard to translation and alignment induced by the proposed method were exposed. The annotation performance was evaluated on an external data set and measured in comparison with an existing baseline model that has been trained on a dedicated German data set in a traditional fashion. We discussed the drop in annotation performance on an external data set for our simple NER model. Our model is publicly available.

CONCLUSIONS

We demonstrated the feasibility of obtaining a data set and training a German medical NER model by the exclusive use of public training data through our suggested method. The discussion on the limitations of our approach includes ways to further mitigate remaining problems in future work.

摘要

背景

医学数据分析领域中的数据挖掘通常需要仅依靠对非结构化数据的处理来检索相关数据。对于德语自然语言处理,在本研究之前几乎没有公开的医学神经命名实体识别(NER)模型发布。一个主要问题可归因于缺乏德语训练数据。

目的

我们开发了一个合成数据集和一个新颖的德语医学NER模型以供公众使用,以证明我们方法的可行性。为了绕过因模型分析可能导致的数据泄露而产生的法律限制,我们未使用内部专有数据集,而这是数据集发布时常见的否决因素。

方法

基础德语数据集是通过对一个公开的英语数据集进行翻译和词对齐来获取的。该数据集用作模型训练和评估的基础。为了便于演示,我们的NER模型采用了一种简单的网络架构,该架构设计用于满足低计算需求。

结果

所获得的数据集由8599个句子组成,包含30233个注释。在对7种不同NER类型进行训练后,该模型在测试集上的类频率平均F分数达到了0.82。所提出方法在合成数据集中引起的翻译和对齐方面的伪像被暴露出来。在一个外部数据集上评估注释性能,并与一个以传统方式在专用德语数据集上训练的现有基线模型进行比较测量。我们讨论了我们简单的NER模型在外部数据集上注释性能的下降。我们的模型已公开可用。

结论

我们通过建议的方法证明了仅使用公共训练数据来获取数据集并训练德语医学NER模型的可行性。对我们方法局限性的讨论包括在未来工作中进一步缓解剩余问题的方法。

相似文献

1
German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation.使用机器翻译和词对齐创建德语医学命名实体识别模型和数据集:算法开发与验证
JMIR Form Res. 2023 Feb 28;7:e39077. doi: 10.2196/39077.
2
GERNERMED++: Semantic annotation in German medical NLP through transfer-learning, translation and word alignment.GERNERMED++:通过迁移学习、翻译和词对齐实现德语医学自然语言处理中的语义标注。
J Biomed Inform. 2023 Nov;147:104513. doi: 10.1016/j.jbi.2023.104513. Epub 2023 Oct 13.
3
Using the Natural Language Processing System Medical Named Entity Recognition-Japanese to Analyze Pharmaceutical Care Records: Natural Language Processing Analysis.使用自然语言处理系统医学命名实体识别-日语来分析药学监护记录:自然语言处理分析。
JMIR Form Res. 2024 Jun 4;8:e55798. doi: 10.2196/55798.
4
Annotated dataset creation through large language models for non-english medical NLP.通过大型语言模型创建非英语医学自然语言处理的标注数据集。
J Biomed Inform. 2023 Sep;145:104478. doi: 10.1016/j.jbi.2023.104478. Epub 2023 Aug 23.
5
A comparison of few-shot and traditional named entity recognition models for medical text.医学文本的少样本与传统命名实体识别模型比较
Proc (IEEE Int Conf Healthc Inform). 2022 Jun;2022:84-89. doi: 10.1109/ichi54592.2022.00024. Epub 2022 Sep 8.
6
A study of active learning methods for named entity recognition in clinical text.临床文本中命名实体识别的主动学习方法研究
J Biomed Inform. 2015 Dec;58:11-18. doi: 10.1016/j.jbi.2015.09.010. Epub 2015 Sep 15.
7
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
8
Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations.利用词向量将领域知识融入化学和生物医学命名实体识别。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S9. doi: 10.1186/1758-2946-7-S1-S9. eCollection 2015.
9
A comprehensive study of named entity recognition in Chinese clinical text.中文临床文本命名实体识别的综合研究。
J Am Med Inform Assoc. 2014 Sep-Oct;21(5):808-14. doi: 10.1136/amiajnl-2013-002381. Epub 2013 Dec 17.
10
Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features.使用带有词表示特征的结构支持向量机识别医院出院小结中的临床实体。
BMC Med Inform Decis Mak. 2013;13 Suppl 1(Suppl 1):S1. doi: 10.1186/1472-6947-13-S1-S1. Epub 2013 Apr 5.

引用本文的文献

1
Clinical document corpora-real ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German text data.临床文档语料库——真实语料库、翻译语料库和合成替代语料库,以及各类领域替代语料库:语料库设计多样性调查,重点关注德语文本数据
JAMIA Open. 2025 May 14;8(3):ooaf024. doi: 10.1093/jamiaopen/ooaf024. eCollection 2025 Jun.
2
Year 2023 in Biomedical Natural Language Processing: a Tribute to Large Language Models and Generative AI.2023年生物医学自然语言处理领域:向大语言模型和生成式人工智能致敬。
Yearb Med Inform. 2024 Aug;33(1):241-248. doi: 10.1055/s-0044-1800751. Epub 2025 Apr 8.
3

本文引用的文献

1
Modern Clinical Text Mining: A Guide and Review.现代临床文本挖掘:指南与综述。
Annu Rev Biomed Data Sci. 2021 Jul 20;4:165-187. doi: 10.1146/annurev-biodatasci-030421-030931. Epub 2021 May 26.
2
Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.医学BERT:基于大规模结构化电子健康记录进行疾病预测的预训练上下文嵌入模型
NPJ Digit Med. 2021 May 20;4(1):86. doi: 10.1038/s41746-021-00455-y.
3
Annotation and initial evaluation of a large annotated German oncological corpus.
A Visualization Method of Knowledge Graphs for the Computation and Comprehension of Ultrasound Reports.
一种用于超声报告计算与理解的知识图谱可视化方法。
Biomimetics (Basel). 2023 Nov 21;8(8):560. doi: 10.3390/biomimetics8080560.
一个大型带注释的德语肿瘤学语料库的注释与初步评估。
JAMIA Open. 2021 Apr 19;4(2):ooab025. doi: 10.1093/jamiaopen/ooab025. eCollection 2021 Apr.
4
Clinical Text Data in Machine Learning: Systematic Review.机器学习中的临床文本数据:系统综述
JMIR Med Inform. 2020 Mar 31;8(3):e17984. doi: 10.2196/17984.
5
Digitalizing Health Services by Implementing a Personal Electronic Health Record in Germany: Qualitative Analysis of Fundamental Prerequisites From the Perspective of Selected Experts.德国通过实施个人电子健康记录实现医疗服务数字化:从部分专家视角对基本前提条件的定性分析
J Med Internet Res. 2020 Jan 29;22(1):e15102. doi: 10.2196/15102.
6
Knowledge-based best of breed approach for automated detection of clinical events based on German free text digital hospital discharge letters.基于知识的最佳实践方法,用于基于德语自由文本数字出院记录自动检测临床事件。
PLoS One. 2019 Nov 27;14(11):e0224916. doi: 10.1371/journal.pone.0224916. eCollection 2019.
7
2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records.2018n2c2 电子健康记录中药物不良反应和药物提取共享任务。
J Am Med Inform Assoc. 2020 Jan 1;27(1):3-12. doi: 10.1093/jamia/ocz166.
8
Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study.基于大规模电子健康记录笔记对基于变换器的双向编码器表征(BERT)模型进行微调:一项实证研究。
JMIR Med Inform. 2019 Sep 12;7(3):e14830. doi: 10.2196/14830.
9
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
10
Quantitative analysis of manual annotation of clinical text samples.临床文本样本的人工标注定量分析。
Int J Med Inform. 2019 Mar;123:37-48. doi: 10.1016/j.ijmedinf.2018.12.011. Epub 2018 Dec 31.