使用机器翻译和词对齐创建德语医学命名实体识别模型和数据集：算法开发与验证

German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation.

作者信息

Frei Johann, Kramer Frank

机构信息

IT Infrastructure for Translational Medical Research, University of Augsburg, Augsburg, Germany.

出版信息

JMIR Form Res. 2023 Feb 28;7:e39077. doi: 10.2196/39077.

DOI:10.2196/39077

PMID:36853741

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10015355/

Abstract

BACKGROUND

Data mining in the field of medical data analysis often needs to rely solely on the processing of unstructured data to retrieve relevant data. For German natural language processing, few open medical neural named entity recognition (NER) models have been published before this work. A major issue can be attributed to the lack of German training data.

OBJECTIVE

We developed a synthetic data set and a novel German medical NER model for public access to demonstrate the feasibility of our approach. In order to bypass legal restrictions due to potential data leaks through model analysis, we did not make use of internal, proprietary data sets, which is a frequent veto factor for data set publication.

METHODS

The underlying German data set was retrieved by translation and word alignment of a public English data set. The data set served as a foundation for model training and evaluation. For demonstration purposes, our NER model follows a simple network architecture that is designed for low computational requirements.

RESULTS

The obtained data set consisted of 8599 sentences including 30,233 annotations. The model achieved a class frequency-averaged F score of 0.82 on the test set after training across 7 different NER types. Artifacts in the synthesized data set with regard to translation and alignment induced by the proposed method were exposed. The annotation performance was evaluated on an external data set and measured in comparison with an existing baseline model that has been trained on a dedicated German data set in a traditional fashion. We discussed the drop in annotation performance on an external data set for our simple NER model. Our model is publicly available.

CONCLUSIONS

We demonstrated the feasibility of obtaining a data set and training a German medical NER model by the exclusive use of public training data through our suggested method. The discussion on the limitations of our approach includes ways to further mitigate remaining problems in future work.

摘要

背景

医学数据分析领域中的数据挖掘通常需要仅依靠对非结构化数据的处理来检索相关数据。对于德语自然语言处理，在本研究之前几乎没有公开的医学神经命名实体识别（NER）模型发布。一个主要问题可归因于缺乏德语训练数据。

目的

我们开发了一个合成数据集和一个新颖的德语医学NER模型以供公众使用，以证明我们方法的可行性。为了绕过因模型分析可能导致的数据泄露而产生的法律限制，我们未使用内部专有数据集，而这是数据集发布时常见的否决因素。

方法

基础德语数据集是通过对一个公开的英语数据集进行翻译和词对齐来获取的。该数据集用作模型训练和评估的基础。为了便于演示，我们的NER模型采用了一种简单的网络架构，该架构设计用于满足低计算需求。

结果

所获得的数据集由8599个句子组成，包含30233个注释。在对7种不同NER类型进行训练后，该模型在测试集上的类频率平均F分数达到了0.82。所提出方法在合成数据集中引起的翻译和对齐方面的伪像被暴露出来。在一个外部数据集上评估注释性能，并与一个以传统方式在专用德语数据集上训练的现有基线模型进行比较测量。我们讨论了我们简单的NER模型在外部数据集上注释性能的下降。我们的模型已公开可用。

结论

我们通过建议的方法证明了仅使用公共训练数据来获取数据集并训练德语医学NER模型的可行性。对我们方法局限性的讨论包括在未来工作中进一步缓解剩余问题的方法。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用机器翻译和词对齐创建德语医学命名实体识别模型和数据集：算法开发与验证

German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

使用机器翻译和词对齐创建德语医学命名实体识别模型和数据集：算法开发与验证

German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献