Istituto Dalle Molle di Studi sull'Intelligenza Artificiale USI/SUPSI, Lugano, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland.
Istituto Dalle Molle di Studi sull'Intelligenza Artificiale USI/SUPSI, Lugano, Switzerland; Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland; Swiss Institute of Bioinformatics, Lausanne, Switzerland.
J Biomed Inform. 2021 Oct;122:103893. doi: 10.1016/j.jbi.2021.103893. Epub 2021 Sep 2.
Entity relation extraction plays an important role in the biomedical, healthcare, and clinical research areas. Recently, pre-trained models based on transformer architectures and their variants have shown remarkable performances in various natural language processing tasks. Most of these variants were based on slight modifications in the architectural components, representation schemes and augmenting data using distant supervision methods. In distantly supervised methods, one of the main challenges is pruning out noisy samples. A similar situation can arise when the training samples are not directly available but need to be constructed from the given dataset. The BioCreative V Chemical Disease Relation (CDR) task provides a dataset that does not explicitly offer mention-level gold annotations and hence replicates the above scenario. Selecting the representative sentences from the given abstract or document text that could convey a potential entity relationship becomes essential. Most of the existing methods in literature propose to either consider the entire text or all the sentences which contain the entity mentions. This could be a computationally expensive and time consuming approach. This paper presents a novel approach to handle such scenarios, specifically in biomedical relation extraction. We propose utilizing the Shortest Dependency Path (SDP) features for constructing data samples by pruning out noisy information and selecting the most representative samples for model learning. We also utilize triplet information in model learning using the biomedical variant of BERT, viz., BioBERT. The problem is represented as a sentence pair classification task using the sentence and the entity-relation pair as input. We analyze the approach on both intra-sentential and inter-sentential relations in the CDR dataset. The proposed approach that utilizes the SDP and triplet features presents promising results, specifically on the inter-sentential relation extraction task. We make the code used for this work publicly available on Github..
实体关系抽取在生物医学、医疗保健和临床研究领域发挥着重要作用。最近,基于转换器架构及其变体的预训练模型在各种自然语言处理任务中表现出了显著的性能。这些变体大多数都是基于对架构组件、表示方案的细微修改,以及使用远程监督方法增强数据。在远程监督方法中,主要挑战之一是剔除噪声样本。当训练样本不可直接获得,而需要从给定的数据集构建时,就会出现类似的情况。BioCreative V 化学疾病关系(CDR)任务提供了一个数据集,该数据集没有明确提供提及级别的黄金标注,因此复制了上述情况。从给定的摘要或文档文本中选择能够传达潜在实体关系的代表性句子变得至关重要。文献中的大多数现有方法要么提出考虑整个文本,要么提出考虑包含实体提及的所有句子。这可能是一种计算成本高且耗时的方法。本文提出了一种处理此类情况的新方法,特别是在生物医学关系抽取中。我们提出利用最短依赖路径(SDP)特征构建数据样本,剔除噪声信息并选择最具代表性的样本进行模型学习。我们还利用生物医学变体的 BERT(即 BioBERT)在模型学习中使用三元组信息。该问题表示为使用句子和实体-关系对作为输入的句子对分类任务。我们在 CDR 数据集的句子内和句子间关系上分析了该方法。利用 SDP 和三元组特征的提出的方法在句子间关系抽取任务中取得了有希望的结果。我们在 Github 上公开了用于此工作的代码。