Borchert Florian, Llorca Ignacio, Roller Roland, Arnrich Bert, Schapranow Matthieu-P
Hasso Plattner Institute for Digital Engineering, University of Potsdam, Potsdam 14482, Germany.
Speech and Language Technology Lab, German Research Center for Artificial Intelligence (DFKI), Berlin 10559, Germany.
JAMIA Open. 2024 Dec 26;8(1):ooae147. doi: 10.1093/jamiaopen/ooae147. eCollection 2025 Feb.
To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English.
We propose xMEN, a modular system for cross-lingual (x) medical entity normalization (MEN), accommodating both low- and high-resource scenarios. To account for the scarcity of aliases for many target languages and terminologies, we leverage multilingual aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder (CE) model if annotations for the target task are available. To balance the output of general-purpose candidate generators with subsequent trainable re-rankers, we introduce a novel rank regularization term in the loss function for training CEs. For re-ranking without gold-standard annotations, we introduce multiple new weakly labeled datasets using machine translation and projection of annotations from a high-resource language.
xMEN improves the state-of-the-art performance across various benchmark datasets for several European languages. Weakly supervised CEs are effective when no training data is available for the target task.
We perform an analysis of normalization errors, revealing that complex entities are still challenging to normalize. New modules and benchmark datasets can be easily integrated in the future.
xMEN exhibits strong performance for medical entity normalization in many languages, even when no labeled data and few terminology aliases for the target language are available. To enable reproducible benchmarks in the future, we make the system available as an open-source Python toolkit. The pre-trained models and source code are available online: https://github.com/hpi-dhc/xmen.
提高多种语言的医学实体归一化性能,尤其是在与英语相比可用语言资源较少的情况下。
我们提出了xMEN,这是一个用于跨语言(x)医学实体归一化(MEN)的模块化系统,适用于低资源和高资源场景。为了解决许多目标语言和术语别名稀缺的问题,我们通过跨语言候选生成利用多语言别名。对于候选排序,如果有目标任务的注释,我们纳入一个可训练的跨编码器(CE)模型。为了平衡通用候选生成器的输出与后续可训练的重排器,我们在训练CE的损失函数中引入了一个新的排序正则化项。对于没有黄金标准注释的重排,我们使用机器翻译和从高资源语言投影注释来引入多个新的弱标记数据集。
xMEN在多个欧洲语言的各种基准数据集上提高了当前的性能。当目标任务没有训练数据时,弱监督的CE是有效的。
我们对归一化错误进行了分析,发现复杂实体的归一化仍然具有挑战性。新模块和基准数据集将来可以很容易地集成。
xMEN在多种语言的医学实体归一化方面表现出强大的性能,即使目标语言没有标记数据且术语别名很少。为了在未来实现可重复的基准测试,我们将该系统作为一个开源Python工具包提供。预训练模型和源代码可在线获取:https://github.com/hpi-dhc/xmen。