GraphNABP：利用蛋白质图和蛋白质语言模型识别核酸结合蛋白。

GraphNABP: Identifying nucleic acid-binding proteins with protein graphs and protein language models.

作者信息

Li Xiang, Wei Zhuoyu, Hu Yueran, Zhu Xiaolei

机构信息

School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China.

出版信息

Int J Biol Macromol. 2024 Sep 12;280(Pt 1):135599. doi: 10.1016/j.ijbiomac.2024.135599.

DOI:10.1016/j.ijbiomac.2024.135599

PMID:39276905

Abstract

The computational identification of nucleic acid-binding proteins (NABP) is of great significance for understanding the mechanisms of these biological activities and drug discovery. Although a bunch of sequence-based methods have been proposed to predict NABP and achieved promising performance, the structure information is often overlooked. On the other hand, the power of popular protein language models (pLM) has seldom been harnessed for predicting NABPs. In this study, we propose a novel framework called GraphNABP, to predict NABP by integrating sequence and predicted 3D structure information. Specifically, sequence embeddings and protein molecular graphs were first obtained from ProtT5 protein language model and predicted 3D structures, respectively. Then, graph attention (GAT) and bidirectional long short-term memory (BiLSTM) neural networks were used to enhance feature representations. Finally, a fully connected layer is used to predict NABPs. To the best of our knowledge, this is the first time to integrate AlphaFold and protein language models for the prediction of NABPs. The performances on multiple independent test sets indicate that GraphNABP outperforms other state-of-the-art methods. Our results demonstrate the effectiveness of pLM embeddings and structural information for NABP prediction. The codes and data used in this study are available at https://github.com/lixiangli01/GraphNABP.

摘要

核酸结合蛋白（NABP）的计算识别对于理解这些生物活性机制和药物发现具有重要意义。尽管已经提出了一系列基于序列的方法来预测NABP并取得了有前景的性能，但结构信息往往被忽视。另一方面，流行的蛋白质语言模型（pLM）的强大功能很少被用于预测NABP。在本研究中，我们提出了一种名为GraphNABP的新型框架，通过整合序列和预测的三维结构信息来预测NABP。具体而言，序列嵌入和蛋白质分子图分别首先从ProtT5蛋白质语言模型和预测的三维结构中获得。然后，使用图注意力（GAT）和双向长短期记忆（BiLSTM）神经网络来增强特征表示。最后，使用全连接层来预测NABP。据我们所知，这是首次将AlphaFold和蛋白质语言模型整合用于NABP的预测。在多个独立测试集上的性能表明，GraphNABP优于其他现有最先进的方法。我们的结果证明了pLM嵌入和结构信息在NABP预测中的有效性。本研究中使用的代码和数据可在https://github.com/lixiangli01/GraphNABP上获取。