Lai Po-Ting, Wei Chih-Hsuan, Luo Ling, Chen Qingyu, Lu Zhiyong
National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA.
School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China.
ArXiv. 2023 Jun 19:arXiv:2306.11189v1.
Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.
生物医学关系提取(RE)是一项从自由文本中自动识别和表征生物医学概念之间关系的任务。RE是生物医学自然语言处理(NLP)研究中的核心任务,在许多下游应用中发挥着关键作用,例如基于文献的发现和知识图谱构建。目前的先进方法主要用于在单个RE数据集(如蛋白质 - 蛋白质相互作用和化学诱导疾病关系)上训练机器学习模型。然而,人工数据集标注成本高昂且耗时,因为它需要领域知识。现有的RE数据集通常是特定领域的或规模较小,这限制了通用且高性能的RE模型的发展。在这项工作中,我们提出了一个新颖的框架,用于系统地解决单个数据集的数据异质性问题,并将它们组合成一个大型数据集。基于该框架和数据集,我们报告了BioREx,一种以数据为中心的关系提取方法。我们的评估表明,BioREx的性能显著高于在单个数据集上训练的基准系统,在最近发布的BioRED语料库上,F-1度量从74.4%提高到79.6%,创造了新的最优成绩。我们进一步证明,组合后的数据集可以提高五个不同RE任务的性能。此外,我们表明,平均而言,BioREx与当前表现最佳的方法(如迁移学习和多任务学习)相比具有优势。最后,我们展示了BioREx在训练数据中未出现过的两个独立RE任务中的稳健性和通用性:药物 - 药物N元组合和文档级基因 - 疾病RE。集成数据集和优化方法已打包为一个独立工具,可在https://github.com/ncbi/BioREx获取。
J Biomed Inform. 2023-10
Brief Bioinform. 2022-9-20
Database (Oxford). 2024-8-28
J Biomed Inform. 2018-9-12
BMC Bioinformatics. 2014-8-23
J Biomed Semantics. 2018-1-30
Bioinformatics. 2021-4-5
BMC Med Inform Decis Mak. 2022-9-6
Brief Bioinform. 2022-9-20
Bioinformatics. 2022-1-27
J Biomed Inform. 2021-11
Bioinformatics. 2021-6-9
Nucleic Acids Res. 2021-1-8
IEEE/ACM Trans Comput Biol Bioinform. 2022