Suppr超能文献

一种基于度量学习的生物医学实体链接方法。

A metric learning-based method for biomedical entity linking.

作者信息

Le Ngoc D, Nguyen Nhung T H

机构信息

Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam.

Vietnam National University, Ho Chi Minh City, Vietnam.

出版信息

Front Res Metr Anal. 2023 Dec 19;8:1247094. doi: 10.3389/frma.2023.1247094. eCollection 2023.

Abstract

Biomedical entity linking task is the task of mapping mention(s) that occur in a particular textual context to a unique concept or in a knowledge base, e.g., the Unified Medical Language System (UMLS). One of the most challenging aspects of the entity linking task is the ambiguity of mentions, i.e., (1) mentions whose surface forms are very similar, but which map to different entities in different contexts, and (2) entities that can be expressed using diverse types of mentions. Recent studies have used BERT-based encoders to encode mentions and entities into distinguishable representations such that their similarity can be measured using distance metrics. However, most real-world biomedical datasets suffer from severe imbalance, i.e., some classes have many instances while others appear only once or are completely absent from the training data. A common way to address this issue is to down-sample the dataset, i.e., to reduce the number instances of the majority classes to make the dataset more balanced. In the context of entity linking, down-sampling reduces the ability of the model to comprehensively learn the representations of mentions in different contexts, which is very important. To tackle this issue, we propose a metric-based learning method that treats a given entity and its mentions as a whole, regardless of the number of mentions in the training set. Specifically, our method uses a triplet loss-based function in conjunction with a clustering technique to learn the representation of mentions and entities. Through evaluations on two challenging biomedical datasets, i.e., MedMentions and BC5CDR, we show that our proposed method is able to address the issue of imbalanced data and to perform competitively with other state-of-the-art models. Moreover, our method significantly reduces computational cost in both training and inference steps. Our source code is publicly available here.

摘要

生物医学实体链接任务是将特定文本上下文中出现的提及映射到知识库(例如统一医学语言系统(UMLS))中的唯一概念的任务。实体链接任务最具挑战性的方面之一是提及的歧义性,即:(1)表面形式非常相似,但在不同上下文中映射到不同实体的提及;(2)可以用多种类型的提及来表达的实体。最近的研究使用基于BERT的编码器将提及和实体编码为可区分的表示形式,以便可以使用距离度量来测量它们的相似性。然而,大多数现实世界的生物医学数据集都存在严重的不平衡问题,即某些类别有很多实例,而其他类别在训练数据中只出现一次或完全不存在。解决这个问题的一种常见方法是对数据集进行下采样,即减少多数类别的实例数量以使数据集更加平衡。在实体链接的背景下,下采样会降低模型全面学习不同上下文中提及的表示的能力,而这一点非常重要。为了解决这个问题,我们提出了一种基于度量的学习方法,该方法将给定实体及其提及视为一个整体,而不考虑训练集中提及的数量。具体来说,我们的方法使用基于三元组损失的函数结合聚类技术来学习提及和实体的表示。通过对两个具有挑战性的生物医学数据集(即MedMentions和BC5CDR)的评估,我们表明我们提出的方法能够解决数据不平衡问题,并与其他现有最先进模型具有竞争力。此外,我们的方法在训练和推理步骤中都显著降低了计算成本。我们的源代码在此处公开可用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bd5/10762861/3f5a1b5df553/frma-08-1247094-g0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验