Computer Science, Humboldt-Universität zu Berlin, Berlin 12489, Germany.
Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae474.
Biomedical entity linking (BEL) is the task of grounding entity mentions to a given knowledge base (KB). Recently, neural name-based methods, system identifying the most appropriate name in the KB for a given mention using neural network (either via dense retrieval or autoregressive modeling), achieved remarkable results for the task, without requiring manual tuning or definition of domain/entity-specific rules. However, as name-based methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene).
We present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. BELHD builds upon the BioSyn model with two crucial extensions. First, it performs pre-processing of the KB, during which it expands homonyms with a specifically constructed disambiguating string, thus enforcing unique linking decisions. Second, it introduces candidate sharing, a novel strategy that strengthens the overall training signal by including similar mentions from the same document as positive or negative examples, according to their corresponding KB identifier. Experiments with 10 corpora and 5 entity types show that BELHD improves upon current neural state-of-the-art approaches, achieving the best results in 6 out of 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the prediction model and thus can also improve other neural methods, which we exemplify for GenBioEL, a generative name-based BEL approach.
The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belhd.
生物医学实体链接(BEL)是将实体提及与给定知识库(KB)联系起来的任务。最近,基于神经的命名方法,即通过神经网络(无论是通过密集检索还是自回归建模)识别给定提及在 KB 中最合适的名称,在该任务中取得了显著的成果,而无需手动调整或定义领域/实体特定的规则。然而,由于基于名称的方法直接返回 KB 名称,因此它们无法处理同形异义词,即共享完全相同名称的不同 KB 实体。这极大地影响了它们在同形异义词占大量实体提及的 KB 中的性能(例如 UMLS 和 NCBI Gene)。
我们提出了 BELHD(具有同形异义词消歧的生物医学实体链接),这是一种新的基于名称的方法,可以应对这一挑战。BELHD 基于 BioSyn 模型构建,并进行了两个关键扩展。首先,它对 KB 进行预处理,在此过程中,它使用专门构建的消歧字符串扩展同形异义词,从而强制进行唯一的链接决策。其次,它引入了候选共享,这是一种新策略,通过根据相应的 KB 标识符将同一文档中的类似提及作为正例或负例包含在训练信号中,从而增强整体训练信号。在 10 个语料库和 5 个实体类型的实验中,BELHD 优于当前的神经最先进方法,在 10 个语料库中的 6 个语料库中取得了最佳结果,平均召回率提高了 4.55 个百分点。此外,KB 预处理与预测模型是正交的,因此也可以提高其他神经方法的性能,我们以生成式命名 BEL 方法 GenBioEL 为例进行了说明。
重现我们实验的代码可以在以下网址找到:https://github.com/sg-wbi/belhd。