Li Xiang, Peng Wei, Zhu Xiaolei
School of Information and Artificial Intelligence, Anhui Agricultural University, 130 Changjiang Road, Shushan District, Hefei, Anhui 230036, China.
Brief Bioinform. 2025 Aug 31;26(5). doi: 10.1093/bib/bbaf457.
Protein-nucleic acid binding sites play a crucial role in biological processes such as gene expression, signal transduction, replication, and transcription. In recent years, with the development of artificial intelligence, protein language models, graph neural networks, and transformer architectures have been adopted to develop both structure-based and sequence-based predictive models. Structure-based methods benefit from the spatial relationship between residues and have shown promising performance. However, structure-based information requires 3D protein structures, which is a challenge for large-scale protein sequence spaces. To address this limitation, researchers have attempted to use predicted protein structure information to guide binding site prediction. While this strategy has improved accuracy, it still depends on the quality of structure predictions. Thus, some studies have returned to prediction methods based solely on protein sequences, particularly those using protein language models, which have greatly enhanced the prediction accuracy. This paper proposes a novel protein-nucleic acid binding site prediction framework, ATtention Maps and Graph convolutional neural networks to predict nucleic acid-protein Binding sites (ATMGBs), which first fuses protein language embeddings with physicochemical properties to obtain multiview information, then leverages the attention map of a protein language model to simulate the relationship between residues, and then utilizes graph convolutional networks for enhancing the feature representations for final prediction. ATMGBs was evaluated on several different independent test sets. The results indicate that the proposed approach significantly improves sequence-based prediction performance, even achieving prediction accuracy comparable to structure-based frameworks. The dataset and code used in this study are available at https://github.com/lixiangli01/ATMGBs.
蛋白质 - 核酸结合位点在基因表达、信号转导、复制和转录等生物过程中起着至关重要的作用。近年来,随着人工智能的发展,蛋白质语言模型、图神经网络和变换器架构已被用于开发基于结构和基于序列的预测模型。基于结构的方法受益于残基之间的空间关系,并已显示出有前景的性能。然而,基于结构的信息需要三维蛋白质结构,这对于大规模蛋白质序列空间来说是一个挑战。为了解决这一限制,研究人员尝试使用预测的蛋白质结构信息来指导结合位点预测。虽然这种策略提高了准确性,但它仍然依赖于结构预测的质量。因此,一些研究又回到了仅基于蛋白质序列的预测方法,特别是那些使用蛋白质语言模型的方法,这些方法大大提高了预测准确性。本文提出了一种新颖的蛋白质 - 核酸结合位点预测框架,即注意力图谱和图卷积神经网络预测核酸 - 蛋白质结合位点(ATMGBs),该框架首先将蛋白质语言嵌入与物理化学性质融合以获得多视图信息,然后利用蛋白质语言模型的注意力图谱来模拟残基之间的关系,接着利用图卷积网络增强特征表示以进行最终预测。ATMGBs在几个不同的独立测试集上进行了评估。结果表明,所提出的方法显著提高了基于序列的预测性能,甚至达到了与基于结构的框架相当的预测准确性。本研究中使用的数据集和代码可在https://github.com/lixiangli01/ATMGBs上获取。