基于GPT和BERT模型在生物医学文本中识别蛋白质-蛋白质相互作用的评估

Evaluation of GPT and BERT-based models on identifying proteinprotein interactions in biomedical text.

作者信息

Rehana Hasin, Çam Nur Bengisu, Basmaci Mert, Zheng Jie, Jemiyo Christianah, He Yongqun, Özgür Arzucan, Hur Junguk

机构信息

Computer Science Graduate Program, University of North Dakota, Grand Forks, North Dakota, 58202, USA.

Department of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey.

出版信息

ArXiv. 2023 Dec 13:arXiv:2303.17728v2.

PMID:38764593

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11101131/

Abstract

Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. However, with the fast-paced growth of biomedical literature, there is a growing need for automated and accurate extraction of PPIs to facilitate scientific knowledge discovery. Pre-trained language models, such as generative pre-trained transformers (GPT) and bidirectional encoder representations from transformers (BERT), have shown promising results in natural language processing (NLP) tasks. We evaluated the performance of PPI identification of multiple GPT and BERT models using three manually curated gold-standard corpora: Learning Language in Logic (LLL) with 164 PPIs in 77 sentences, Human Protein Reference Database with 163 PPIs in 145 sentences, and Interaction Extraction Performance Assessment with 335 PPIs in 486 sentences. BERT-based models achieved the best overall performance, with BioBERT achieving the highest recall (91.95%) and F1-score (86.84%) and PubMedBERT achieving the highest precision (85.25%). Interestingly, despite not being explicitly trained for biomedical texts, GPT-4 achieved commendable performance, comparable to the top-performing BERT models. It achieved a precision of 88.37%, a recall of 85.14%, and an F1-score of 86.49% on the LLL dataset. These results suggest that GPT models can effectively detect PPIs from text data, offering promising avenues for application in biomedical literature mining. Further research could explore how these models might be fine-tuned for even more specialized tasks within the biomedical domain.

摘要

检测蛋白质 - 蛋白质相互作用（PPI）对于理解遗传机制、疾病发病机制和药物设计至关重要。然而，随着生物医学文献的快速增长，对PPI进行自动化和准确提取以促进科学知识发现的需求日益增加。预训练语言模型，如生成式预训练变换器（GPT）和来自变换器的双向编码器表示（BERT），在自然语言处理（NLP）任务中已显示出有前景的结果。我们使用三个手动整理的金标准语料库评估了多个GPT和BERT模型在PPI识别方面的性能：逻辑中的学习语言（LLL），77个句子中有164个PPI；人类蛋白质参考数据库，145个句子中有163个PPI；以及交互提取性能评估，486个句子中有335个PPI。基于BERT的模型总体表现最佳，BioBERT的召回率最高（91.95%）和F1分数最高（86.84%），PubMedBERT的精度最高（85.25%）。有趣的是，尽管GPT - 4没有针对生物医学文本进行明确训练，但它取得了值得称赞的性能，与表现最佳的BERT模型相当。在LLL数据集上，它的精度为88.37%，召回率为85.14%，F1分数为86.49%。这些结果表明，GPT模型可以有效地从文本数据中检测PPI，为生物医学文献挖掘中的应用提供了有前景的途径。进一步的研究可以探索如何针对生物医学领域内更专门的任务对这些模型进行微调。