Suppr超能文献

BioREx:通过利用异构数据集改进生物医学关系提取

BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets.

作者信息

Lai Po-Ting, Wei Chih-Hsuan, Luo Ling, Chen Qingyu, Lu Zhiyong

机构信息

National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894 Bethesda, USA.

School of Computer Science and Technology, Dalian University of Technology, 116024 Dalian, China.

出版信息

J Biomed Inform. 2023 Oct;146:104487. doi: 10.1016/j.jbi.2023.104487. Epub 2023 Sep 4.

Abstract

Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.

摘要

生物医学关系抽取(RE)是一项从自由文本中自动识别和刻画生物医学概念之间关系的任务。RE是生物医学自然语言处理(NLP)研究中的核心任务,在许多下游应用中发挥着关键作用,如基于文献的发现和知识图谱构建。最先进的方法主要用于在单个RE数据集上训练机器学习模型,如蛋白质-蛋白质相互作用和化学诱导疾病关系。然而,人工数据集标注成本高昂且耗时,因为它需要领域知识。现有的RE数据集通常是特定领域的或规模较小,这限制了通用且高性能的RE模型的发展。在这项工作中,我们提出了一个新颖的框架,用于系统地解决单个数据集的数据异质性问题,并将它们组合成一个大型数据集。基于该框架和数据集,我们报告了BioREx,一种以数据为中心的关系抽取方法。我们的评估表明,BioREx的性能显著高于在单个数据集上训练的基准系统,在最近发布的BioRED语料库上,F-1度量从74.4%提高到79.6%,创造了新的最优结果。我们进一步证明,组合后的数据集可以提高五种不同RE任务的性能。此外,我们表明,平均而言,BioREx与当前表现最佳的方法(如迁移学习和多任务学习)相比具有优势。最后,我们展示了BioREx在训练数据中未见过的两个独立RE任务中的鲁棒性和通用性:药物-药物N元组合和文档级基因-疾病RE。集成数据集和优化方法已打包为一个独立工具,可在https://github.com/ncbi/BioREx上获取。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验