文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

BioREx:利用异构数据集改进生物医学关系抽取

BioREx: Improving Biomedical Relation Extraction by Leveraging Heterogeneous Datasets.

作者信息

Lai Po-Ting, Wei Chih-Hsuan, Luo Ling, Chen Qingyu, Lu Zhiyong

机构信息

National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), MD, 20894, Bethesda, USA.

School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China.

出版信息

ArXiv. 2023 Jun 19:arXiv:2306.11189v1.


DOI:
PMID:37502629
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10370213/
Abstract

Biomedical relation extraction (RE) is the task of automatically identifying and characterizing relations between biomedical concepts from free text. RE is a central task in biomedical natural language processing (NLP) research and plays a critical role in many downstream applications, such as literature-based discovery and knowledge graph construction. State-of-the-art methods were used primarily to train machine learning models on individual RE datasets, such as protein-protein interaction and chemical-induced disease relation. Manual dataset annotation, however, is highly expensive and time-consuming, as it requires domain knowledge. Existing RE datasets are usually domain-specific or small, which limits the development of generalized and high-performing RE models. In this work, we present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Based on the framework and dataset, we report on BioREx, a data-centric approach for extracting relations. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset, setting a new SOTA from 74.4% to 79.6% in F-1 measure on the recently released BioRED corpus. We further demonstrate that the combined dataset can improve performance for five different RE tasks. In addition, we show that on average BioREx compares favorably to current best-performing methods such as transfer learning and multi-task learning. Finally, we demonstrate BioREx's robustness and generalizability in two independent RE tasks not previously seen in training data: drug-drug N-ary combination and document-level gene-disease RE. The integrated dataset and optimized method have been packaged as a stand-alone tool available at https://github.com/ncbi/BioREx.

摘要

生物医学关系提取(RE)是一项从自由文本中自动识别和表征生物医学概念之间关系的任务。RE是生物医学自然语言处理(NLP)研究中的核心任务,在许多下游应用中发挥着关键作用,例如基于文献的发现和知识图谱构建。目前的先进方法主要用于在单个RE数据集(如蛋白质 - 蛋白质相互作用和化学诱导疾病关系)上训练机器学习模型。然而,人工数据集标注成本高昂且耗时,因为它需要领域知识。现有的RE数据集通常是特定领域的或规模较小,这限制了通用且高性能的RE模型的发展。在这项工作中,我们提出了一个新颖的框架,用于系统地解决单个数据集的数据异质性问题,并将它们组合成一个大型数据集。基于该框架和数据集,我们报告了BioREx,一种以数据为中心的关系提取方法。我们的评估表明,BioREx的性能显著高于在单个数据集上训练的基准系统,在最近发布的BioRED语料库上,F-1度量从74.4%提高到79.6%,创造了新的最优成绩。我们进一步证明,组合后的数据集可以提高五个不同RE任务的性能。此外,我们表明,平均而言,BioREx与当前表现最佳的方法(如迁移学习和多任务学习)相比具有优势。最后,我们展示了BioREx在训练数据中未出现过的两个独立RE任务中的稳健性和通用性:药物 - 药物N元组合和文档级基因 - 疾病RE。集成数据集和优化方法已打包为一个独立工具,可在https://github.com/ncbi/BioREx获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcae/10370213/daf26bdfce3b/nihpp-2306.11189v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcae/10370213/87116f8b3638/nihpp-2306.11189v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcae/10370213/efb233c627a8/nihpp-2306.11189v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcae/10370213/ca158bec4a9f/nihpp-2306.11189v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcae/10370213/9201079c82c4/nihpp-2306.11189v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcae/10370213/daf26bdfce3b/nihpp-2306.11189v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcae/10370213/87116f8b3638/nihpp-2306.11189v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcae/10370213/efb233c627a8/nihpp-2306.11189v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcae/10370213/ca158bec4a9f/nihpp-2306.11189v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcae/10370213/9201079c82c4/nihpp-2306.11189v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcae/10370213/daf26bdfce3b/nihpp-2306.11189v1-f0005.jpg

相似文献

[1]
BioREx: Improving Biomedical Relation Extraction by Leveraging Heterogeneous Datasets.

ArXiv. 2023-6-19

[2]
BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets.

J Biomed Inform. 2023-10

[3]
BioRED: a rich biomedical relation extraction dataset.

Brief Bioinform. 2022-9-20

[4]
Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach.

Database (Oxford). 2024-8-28

[5]
A comparison of word embeddings for the biomedical natural language processing.

J Biomed Inform. 2018-9-12

[6]
A generalizable NLP framework for fast development of pattern-based biomedical relation extraction systems.

BMC Bioinformatics. 2014-8-23

[7]
Exploiting graph kernels for high performance biomedical relation extraction.

J Biomed Semantics. 2018-1-30

[8]
BERT-GT: cross-sentence n-ary relation extraction with BERT and Graph Transformer.

Bioinformatics. 2021-4-5

[9]
A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing.

Database (Oxford). 2020-12-1

[10]
BertSRC: transformer-based semantic relation classification.

BMC Med Inform Decis Mak. 2022-9-6

本文引用的文献

[1]
Integrating heterogeneous knowledge graphs into drug-drug interaction extraction from the literature.

Bioinformatics. 2023-1-1

[2]
BioRED: a rich biomedical relation extraction dataset.

Brief Bioinform. 2022-9-20

[3]
Discovering novel drug-supplement interactions using SuppKG generated from the biomedical literature.

J Biomed Inform. 2022-7

[4]
PheneBank: a literature-based database of phenotypes.

Bioinformatics. 2022-1-27

[5]
Protein-protein interaction relation extraction based on multigranularity semantic fusion.

J Biomed Inform. 2021-11

[6]
Using computable knowledge mined from the literature to elucidate confounders for EHR-based pharmacovigilance.

J Biomed Inform. 2021-5

[7]
EpiGraphDB: a database and data mining platform for health data science.

Bioinformatics. 2021-6-9

[8]
Comparative Toxicogenomics Database (CTD): update 2021.

Nucleic Acids Res. 2021-1-8

[9]
Relation Extraction From Biomedical and Clinical Text: Unified Multitask Learning Framework.

IEEE/ACM Trans Comput Biol Bioinform. 2022

[10]
Document-Level Biomedical Relation Extraction Using Graph Convolutional Network and Multihead Attention: Algorithm Development and Validation.

JMIR Med Inform. 2020-7-31

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索