调查生物医学关系抽取：对当前数据集的批判性考察及新资源的提出。

Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource.

机构信息

Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan.

National Institute of Cancer Research, National Health Research Institutes, Tainan, Taiwan.

出版信息

Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae132.

DOI:10.1093/bib/bbae132

PMID:38609331

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11014787/

Abstract

Natural language processing (NLP) has become an essential technique in various fields, offering a wide range of possibilities for analyzing data and developing diverse NLP tasks. In the biomedical domain, understanding the complex relationships between compounds and proteins is critical, especially in the context of signal transduction and biochemical pathways. Among these relationships, protein-protein interactions (PPIs) are of particular interest, given their potential to trigger a variety of biological reactions. To improve the ability to predict PPI events, we propose the protein event detection dataset (PEDD), which comprises 6823 abstracts, 39 488 sentences and 182 937 gene pairs. Our PEDD dataset has been utilized in the AI CUP Biomedical Paper Analysis competition, where systems are challenged to predict 12 different relation types. In this paper, we review the state-of-the-art relation extraction research and provide an overview of the PEDD's compilation process. Furthermore, we present the results of the PPI extraction competition and evaluate several language models' performances on the PEDD. This paper's outcomes will provide a valuable roadmap for future studies on protein event detection in NLP. By addressing this critical challenge, we hope to enable breakthroughs in drug discovery and enhance our understanding of the molecular mechanisms underlying various diseases.

摘要

自然语言处理 (NLP) 已成为各个领域的一项重要技术，为数据分析和开发各种 NLP 任务提供了广泛的可能性。在生物医学领域，理解化合物和蛋白质之间的复杂关系至关重要，特别是在信号转导和生化途径方面。在这些关系中，蛋白质-蛋白质相互作用 (PPI) 尤为引人注目，因为它们有可能引发各种生物反应。为了提高预测 PPI 事件的能力，我们提出了蛋白质事件检测数据集 (PEDD)，其中包含 6823 篇摘要、39488 个句子和 182937 对基因。我们的 PEDD 数据集已在 AI CUP 生物医学论文分析竞赛中使用，竞赛要求系统预测 12 种不同的关系类型。在本文中，我们回顾了最先进的关系提取研究，并概述了 PEDD 的编译过程。此外，我们还介绍了 PPI 提取竞赛的结果，并评估了几种语言模型在 PEDD 上的性能。本文的研究结果将为未来在 NLP 中进行蛋白质事件检测的研究提供有价值的路线图。通过解决这一关键挑战，我们希望能够在药物发现方面取得突破，并加深我们对各种疾病的分子机制的理解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5159/11014787/34e84aceae6d/bbae132f1.jpg

相似文献

Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource.调查生物医学关系抽取：对当前数据集的批判性考察及新资源的提出。

Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae132.

Biomedical named entity recognition and linking datasets: survey and our recent development.生物医学命名实体识别与链接数据集：综述及我们的最新进展

Brief Bioinform. 2020 Dec 1;21(6):2219-2238. doi: 10.1093/bib/bbaa054.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

Improving the robustness and accuracy of biomedical language models through adversarial training.通过对抗训练提高生物医学语言模型的稳健性和准确性。

J Biomed Inform. 2022 Aug;132:104114. doi: 10.1016/j.jbi.2022.104114. Epub 2022 Jun 15.

Semantic biomedical resource discovery: a Natural Language Processing framework.语义生物医学资源发现：一种自然语言处理框架。

BMC Med Inform Decis Mak. 2015 Sep 30;15:77. doi: 10.1186/s12911-015-0200-4.

Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT.基于领域特定的 ALBERT 进行生物医学自然语言处理任务的基准测试。

BMC Bioinformatics. 2022 Apr 21;23(1):144. doi: 10.1186/s12859-022-04688-w.

Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey.探索多语言医学自然语言处理的最新亮点：综述。

Yearb Med Inform. 2023 Aug;32(1):230-243. doi: 10.1055/s-0043-1768726. Epub 2023 Dec 26.

Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases.通过 BioGRID 和 MINT 交互数据库对 2010 年 BioCreative III 文本挖掘竞赛进行基准测试。

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S8. doi: 10.1186/1471-2105-12-S8-S8.

BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets.BioREx：通过利用异构数据集改进生物医学关系提取

J Biomed Inform. 2023 Oct;146:104487. doi: 10.1016/j.jbi.2023.104487. Epub 2023 Sep 4.

Broad-coverage biomedical relation extraction with SemRep.基于 SemRep 的广谱生物医学关系抽取。

BMC Bioinformatics. 2020 May 14;21(1):188. doi: 10.1186/s12859-020-3517-7.

引用本文的文献

Enhancing biomedical relation extraction with directionality.通过方向性增强生物医学关系提取

Bioinformatics. 2025 Jul 1;41(Supplement_1):i68-i76. doi: 10.1093/bioinformatics/btaf226.

The influence of prompt engineering on large language models for protein-protein interaction identification in biomedical literature.提示工程对生物医学文献中蛋白质-蛋白质相互作用识别的大语言模型的影响。

Sci Rep. 2025 May 3;15(1):15493. doi: 10.1038/s41598-025-99290-4.

Computational tools and data integration to accelerate vaccine development: challenges, opportunities, and future directions.加速疫苗开发的计算工具与数据整合：挑战、机遇及未来方向

Front Immunol. 2025 Mar 7;16:1502484. doi: 10.3389/fimmu.2025.1502484. eCollection 2025.

MeSH2Matrix: combining MeSH keywords and machine learning for biomedical relation classification based on PubMed.医学主题词表到矩阵：基于PubMed结合医学主题词表关键词与机器学习进行生物医学关系分类

J Biomed Semantics. 2024 Oct 2;15(1):18. doi: 10.1186/s13326-024-00319-w.

本文引用的文献

BioRED: a rich biomedical relation extraction dataset.BioRED：一个丰富的生物医学关系抽取数据集。

Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac282.

CODER: Knowledge-infused cross-lingual medical term embedding for term normalization.知识注入的跨语言医学术语嵌入用于术语归一化。

J Biomed Inform. 2022 Feb;126:103983. doi: 10.1016/j.jbi.2021.103983. Epub 2022 Jan 4.

AMMU: A survey of transformer-based biomedical pretrained language models.基于变压器的生物医学预训练语言模型综述。

J Biomed Inform. 2022 Feb;126:103982. doi: 10.1016/j.jbi.2021.103982. Epub 2021 Dec 31.

miRTarBase update 2022: an informative resource for experimentally validated miRNA-target interactions.miRTarBase 更新 2022：一个经过实验验证的 miRNA-靶标相互作用的信息资源。

Nucleic Acids Res. 2022 Jan 7;50(D1):D222-D230. doi: 10.1093/nar/gkab1079.

RENET2: high-performance full-text gene-disease relation extraction with iterative training data expansion.RENET2：通过迭代训练数据扩展实现的高性能全文本基因-疾病关系提取

NAR Genom Bioinform. 2021 Jul 5;3(3):lqab062. doi: 10.1093/nargab/lqab062. eCollection 2021 Sep.

BERT-GT: cross-sentence n-ary relation extraction with BERT and Graph Transformer.BERT-GT：使用BERT和图变换器进行跨句子n元关系提取

Bioinformatics. 2021 Apr 5;36(24):5678-5685. doi: 10.1093/bioinformatics/btaa1087.

LBERT: Lexically aware Transformer-based Bidirectional Encoder Representation model for learning universal bio-entity relations.LBERT：基于词汇感知的基于Transformer的双向编码器表示模型，用于学习通用生物实体关系。

Bioinformatics. 2021 Apr 20;37(3):404-412. doi: 10.1093/bioinformatics/btaa721.

Biomedical named entity recognition and linking datasets: survey and our recent development.生物医学命名实体识别与链接数据集：综述及我们的最新进展

Brief Bioinform. 2020 Dec 1;21(6):2219-2238. doi: 10.1093/bib/bbaa054.

Using a Large Margin Context-Aware Convolutional Neural Network to Automatically Extract Disease-Disease Association from Literature: Comparative Analytic Study.使用大间隔上下文感知卷积神经网络从文献中自动提取疾病-疾病关联：比较分析研究。

JMIR Med Inform. 2019 Nov 26;7(4):e14502. doi: 10.2196/14502.

2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records.2018n2c2 电子健康记录中药物不良反应和药物提取共享任务。

J Am Med Inform Assoc. 2020 Jan 1;27(1):3-12. doi: 10.1093/jamia/ocz166.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

调查生物医学关系抽取：对当前数据集的批判性考察及新资源的提出。

Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献