Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan.
National Institute of Cancer Research, National Health Research Institutes, Tainan, Taiwan.
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae132.
Natural language processing (NLP) has become an essential technique in various fields, offering a wide range of possibilities for analyzing data and developing diverse NLP tasks. In the biomedical domain, understanding the complex relationships between compounds and proteins is critical, especially in the context of signal transduction and biochemical pathways. Among these relationships, protein-protein interactions (PPIs) are of particular interest, given their potential to trigger a variety of biological reactions. To improve the ability to predict PPI events, we propose the protein event detection dataset (PEDD), which comprises 6823 abstracts, 39 488 sentences and 182 937 gene pairs. Our PEDD dataset has been utilized in the AI CUP Biomedical Paper Analysis competition, where systems are challenged to predict 12 different relation types. In this paper, we review the state-of-the-art relation extraction research and provide an overview of the PEDD's compilation process. Furthermore, we present the results of the PPI extraction competition and evaluate several language models' performances on the PEDD. This paper's outcomes will provide a valuable roadmap for future studies on protein event detection in NLP. By addressing this critical challenge, we hope to enable breakthroughs in drug discovery and enhance our understanding of the molecular mechanisms underlying various diseases.
自然语言处理 (NLP) 已成为各个领域的一项重要技术,为数据分析和开发各种 NLP 任务提供了广泛的可能性。在生物医学领域,理解化合物和蛋白质之间的复杂关系至关重要,特别是在信号转导和生化途径方面。在这些关系中,蛋白质-蛋白质相互作用 (PPI) 尤为引人注目,因为它们有可能引发各种生物反应。为了提高预测 PPI 事件的能力,我们提出了蛋白质事件检测数据集 (PEDD),其中包含 6823 篇摘要、39488 个句子和 182937 对基因。我们的 PEDD 数据集已在 AI CUP 生物医学论文分析竞赛中使用,竞赛要求系统预测 12 种不同的关系类型。在本文中,我们回顾了最先进的关系提取研究,并概述了 PEDD 的编译过程。此外,我们还介绍了 PPI 提取竞赛的结果,并评估了几种语言模型在 PEDD 上的性能。本文的研究结果将为未来在 NLP 中进行蛋白质事件检测的研究提供有价值的路线图。通过解决这一关键挑战,我们希望能够在药物发现方面取得突破,并加深我们对各种疾病的分子机制的理解。