Li Jiru, Pan Dinghao, Yang Zhihao, Sun Yuanyuan, Lin Hongfei, Wang Jian
School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China.
Database (Oxford). 2024 Dec 19;2024. doi: 10.1093/database/baae125.
Biomedical Relation Extraction (RE) is central to Biomedical Natural Language Processing and is crucial for various downstream applications. Existing RE challenges in the field of biology have primarily focused on intra-sentential analysis. However, with the rapid increase in the volume of literature and the complexity of relationships between biomedical entities, it often becomes necessary to consider multiple sentences to fully extract the relationship between a pair of entities. Current methods often fail to fully capture the complex semantic structures of information in documents, thereby affecting extraction accuracy. Therefore, unlike traditional RE methods that rely on sentence-level analysis and heuristic rules, our method focuses on extracting entity relationships from biomedical literature titles and abstracts and classifying relations that are novel findings. In our method, a multitask training approach is employed for fine-tuning a Pre-trained Language Model in the field of biology. Based on a broad spectrum of carefully designed tasks, our multitask method not only extracts relations of better quality due to more effective supervision but also achieves a more accurate classification of whether the entity pairs are novel findings. Moreover, by applying a model ensemble method, we further enhance our model's performance. The extensive experiments demonstrate that our method achieves significant performance improvements, i.e. surpassing the existing baseline by 3.94% in RE and 3.27% in Triplet Novel Typing in F1 score on BioRED, confirming its effectiveness in handling complex biomedical literature RE tasks. Database URL: https://codalab.lisn.upsaclay.fr/competitions/13377#learn_the_details-dataset.
生物医学关系提取(RE)是生物医学自然语言处理的核心,对各种下游应用至关重要。生物学领域现有的关系提取挑战主要集中在句内分析上。然而,随着文献数量的迅速增加以及生物医学实体之间关系的复杂性,通常有必要考虑多个句子才能完全提取一对实体之间的关系。当前的方法往往无法充分捕捉文档中信息的复杂语义结构,从而影响提取准确性。因此,与依赖句子级分析和启发式规则的传统关系提取方法不同,我们的方法专注于从生物医学文献标题和摘要中提取实体关系,并对作为新发现的关系进行分类。在我们的方法中,采用多任务训练方法对生物学领域的预训练语言模型进行微调。基于广泛精心设计的任务,我们的多任务方法不仅由于更有效的监督而提取出质量更高的关系,而且在实体对是否为新发现的分类上也实现了更准确的结果。此外,通过应用模型集成方法,我们进一步提升了模型的性能。广泛的实验表明,我们的方法取得了显著的性能提升,即在BioRED上的关系提取F1分数超过现有基线3.94%,在三元组新类型识别上超过3.27%,证实了其在处理复杂生物医学文献关系提取任务方面的有效性。数据库网址:https://codalab.lisn.upsaclay.fr/competitions/13377#learn_the_details-dataset 。