xMEN：用于跨语言医学实体规范化的模块化工具包。

xMEN: a modular toolkit for cross-lingual medical entity normalization.

作者信息

Borchert Florian, Llorca Ignacio, Roller Roland, Arnrich Bert, Schapranow Matthieu-P

机构信息

Hasso Plattner Institute for Digital Engineering, University of Potsdam, Potsdam 14482, Germany.

Speech and Language Technology Lab, German Research Center for Artificial Intelligence (DFKI), Berlin 10559, Germany.

出版信息

JAMIA Open. 2024 Dec 26;8(1):ooae147. doi: 10.1093/jamiaopen/ooae147. eCollection 2025 Feb.

DOI:10.1093/jamiaopen/ooae147

PMID:39735785

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11671143/

Abstract

OBJECTIVE

To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English.

MATERIALS AND METHODS

We propose xMEN, a modular system for cross-lingual (x) medical entity normalization (MEN), accommodating both low- and high-resource scenarios. To account for the scarcity of aliases for many target languages and terminologies, we leverage multilingual aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder (CE) model if annotations for the target task are available. To balance the output of general-purpose candidate generators with subsequent trainable re-rankers, we introduce a novel rank regularization term in the loss function for training CEs. For re-ranking without gold-standard annotations, we introduce multiple new weakly labeled datasets using machine translation and projection of annotations from a high-resource language.

RESULTS

xMEN improves the state-of-the-art performance across various benchmark datasets for several European languages. Weakly supervised CEs are effective when no training data is available for the target task.

DISCUSSION

We perform an analysis of normalization errors, revealing that complex entities are still challenging to normalize. New modules and benchmark datasets can be easily integrated in the future.

CONCLUSION

xMEN exhibits strong performance for medical entity normalization in many languages, even when no labeled data and few terminology aliases for the target language are available. To enable reproducible benchmarks in the future, we make the system available as an open-source Python toolkit. The pre-trained models and source code are available online: https://github.com/hpi-dhc/xmen.

摘要

目的

提高多种语言的医学实体归一化性能，尤其是在与英语相比可用语言资源较少的情况下。

材料与方法

我们提出了xMEN，这是一个用于跨语言（x）医学实体归一化（MEN）的模块化系统，适用于低资源和高资源场景。为了解决许多目标语言和术语别名稀缺的问题，我们通过跨语言候选生成利用多语言别名。对于候选排序，如果有目标任务的注释，我们纳入一个可训练的跨编码器（CE）模型。为了平衡通用候选生成器的输出与后续可训练的重排器，我们在训练CE的损失函数中引入了一个新的排序正则化项。对于没有黄金标准注释的重排，我们使用机器翻译和从高资源语言投影注释来引入多个新的弱标记数据集。

结果

xMEN在多个欧洲语言的各种基准数据集上提高了当前的性能。当目标任务没有训练数据时，弱监督的CE是有效的。

讨论

我们对归一化错误进行了分析，发现复杂实体的归一化仍然具有挑战性。新模块和基准数据集将来可以很容易地集成。

结论

xMEN在多种语言的医学实体归一化方面表现出强大的性能，即使目标语言没有标记数据且术语别名很少。为了在未来实现可重复的基准测试，我们将该系统作为一个开源Python工具包提供。预训练模型和源代码可在线获取：https://github.com/hpi-dhc/xmen。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ab58/11671143/2b8bdb76b4ed/ooae147f1.jpg

相似文献

xMEN: a modular toolkit for cross-lingual medical entity normalization.

JAMIA Open. 2024 Dec 26;8(1):ooae147. doi: 10.1093/jamiaopen/ooae147. eCollection 2025 Feb.

Medical concept normalization in French using multilingual terminologies and contextual embeddings.

J Biomed Inform. 2021 Feb;114:103684. doi: 10.1016/j.jbi.2021.103684. Epub 2021 Jan 12.

Improving biomedical entity linking for complex entity mentions with LLM-based text simplification.

Database (Oxford). 2024 Jul 26;2024. doi: 10.1093/database/baae067.

TeaBERT: An Efficient Knowledge Infused Cross-Lingual Language Model for Mapping Chinese Medical Entities to the Unified Medical Language System.

IEEE J Biomed Health Inform. 2023 Dec;27(12):6029-6038. doi: 10.1109/JBHI.2023.3315143. Epub 2023 Dec 6.

On cross-lingual retrieval with multilingual text encoders.

Inf Retr Boston. 2022;25(2):149-183. doi: 10.1007/s10791-022-09406-x. Epub 2022 Mar 7.

Recognition and normalization of multilingual symptom entities using in-domain-adapted BERT models and classification layers.

Database (Oxford). 2024 Aug 28;2024. doi: 10.1093/database/baae087.

Cross-lingual Unified Medical Language System entity linking in online health communities.

J Am Med Inform Assoc. 2020 Oct 1;27(10):1585-1592. doi: 10.1093/jamia/ocaa150.

Transformer-based approach for symptom recognition and multilingual linking.

Database (Oxford). 2024 Sep 10;2024. doi: 10.1093/database/baae090.

TaggerOne: joint named entity recognition and normalization with semi-Markov Models.

Bioinformatics. 2016 Sep 15;32(18):2839-46. doi: 10.1093/bioinformatics/btw343. Epub 2016 Jun 9.

Neural machine translation of clinical text: an empirical investigation into multilingual pre-trained language models and transfer-learning.

Front Digit Health. 2024 Feb 26;6:1211564. doi: 10.3389/fdgth.2024.1211564. eCollection 2024.

引用本文的文献

High-precision information retrieval for rapid clinical guideline updates.

NPJ Digit Med. 2025 Apr 27;8(1):227. doi: 10.1038/s41746-025-01648-5.

本文引用的文献

A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models.

Sci Data. 2024 May 4;11(1):455. doi: 10.1038/s41597-024-03317-w.

Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey.

Yearb Med Inform. 2023 Aug;32(1):230-243. doi: 10.1055/s-0043-1768726. Epub 2023 Dec 26.

BELB: a biomedical entity linking benchmark.

Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad698.

GERNERMED++: Semantic annotation in German medical NLP through transfer-learning, translation and word alignment.

J Biomed Inform. 2023 Nov;147:104513. doi: 10.1016/j.jbi.2023.104513. Epub 2023 Oct 13.

An analysis of entity normalization evaluation biases in specialized domains.

BMC Bioinformatics. 2023 Jun 2;24(1):227. doi: 10.1186/s12859-023-05350-9.

An overview of biomedical entity linking throughout the years.

J Biomed Inform. 2023 Jan;137:104252. doi: 10.1016/j.jbi.2022.104252. Epub 2022 Dec 2.

Critical assessment of transformer-based AI models for German clinical notes.

JAMIA Open. 2022 Nov 15;5(4):ooac087. doi: 10.1093/jamiaopen/ooac087. eCollection 2022 Dec.

CODER: Knowledge-infused cross-lingual medical term embedding for term normalization.

J Biomed Inform. 2022 Feb;126:103983. doi: 10.1016/j.jbi.2021.103983. Epub 2022 Jan 4.

Annotation and initial evaluation of a large annotated German oncological corpus.

JAMIA Open. 2021 Apr 19;4(2):ooab025. doi: 10.1093/jamiaopen/ooab025. eCollection 2021 Apr.

Medical concept normalization in French using multilingual terminologies and contextual embeddings.

J Biomed Inform. 2021 Feb;114:103684. doi: 10.1016/j.jbi.2021.103684. Epub 2021 Jan 12.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

xMEN：用于跨语言医学实体规范化的模块化工具包。

xMEN: a modular toolkit for cross-lingual medical entity normalization.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

DISCUSSION

CONCLUSION

目的

材料与方法

结果

讨论

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献