利用蛋白质语言模型进行混合蛋白质-配体结合残基预测：结构重要吗？

Hybrid protein-ligand binding residue prediction with protein language models: does the structure matter?

作者信息

Gamouh Hamza, Novotný Marian, Hoksza David

机构信息

Faculty of Mathematics and Physics, Charles University, 118 00 Prague, Czech Republic.

Faculty of Science, Charles University, 128 00 Prague, Czech Republic.

出版信息

Bioinformatics. 2025 Aug 2;41(8). doi: 10.1093/bioinformatics/btaf431.

DOI:10.1093/bioinformatics/btaf431

PMID:40742755

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12377911/

Abstract

MOTIVATION

Predicting protein-ligand binding sites is crucial in studying protein interactions with applications in biotechnology and drug discovery. Two distinct paradigms have emerged for this purpose: sequence-based methods, which leverage protein sequence information, and structure-based methods, which rely on the three-dimensional (3D) structure of the protein. Here, we analyze a hybrid approach that combines the strengths of both paradigms by integrating two recent deep learning architectures: protein language models (pLMs) from the sequence-based paradigm and Graph Neural Networks (GNNs) from the structure-based paradigm. Specifically, we construct a residue-level Graph Attention Network (GAT) model based on the protein's 3D structure that uses pre-trained pLM embeddings as node features. This integration enables us to study the interplay between the sequential information encoded in the protein sequence and the spatial relationships within the protein structure on the model performance.

RESULTS

By exploiting a benchmark dataset over a range of ligands and ligand types, we have shown that using the structure information consistently enhances the predictive power of the baselines in absolute terms. Nevertheless, as more complex pLMs are used to represent node features, the relative impact of the structure information represented by the GNN architecture diminishes. The above observations suggest that although the use of the experimental protein structure almost always improves the accuracy of the prediction of the binding site, complex pLMs still contain structural information that leads to good predictive performance even without the use of 3D structure.

AVAILABILITY AND IMPLEMENTATION

The datasets generated and/or analyzed during the current study, as well as pretrained models, are available in the following Zenodo link https://zenodo.org/records/15184302. The source code that was used to generate the results of the current study is available in the following GitHub repository https://github.com/hamzagamouh/pt-lm-gnn as well as in the following Zenodo link https://zenodo.org/records/15192327.

摘要

动机

预测蛋白质-配体结合位点对于研究蛋白质相互作用在生物技术和药物发现中的应用至关重要。为此出现了两种不同的范式：基于序列的方法，利用蛋白质序列信息；基于结构的方法，依赖蛋白质的三维（3D）结构。在此，我们分析一种混合方法，该方法通过整合两种最新的深度学习架构来结合这两种范式的优势：基于序列范式的蛋白质语言模型（pLMs）和基于结构范式的图神经网络（GNNs）。具体而言，我们基于蛋白质的3D结构构建了一个残基级图注意力网络（GAT）模型，该模型使用预训练的pLM嵌入作为节点特征。这种整合使我们能够研究蛋白质序列中编码的序列信息与蛋白质结构内的空间关系对模型性能的相互作用。

结果

通过利用一系列配体和配体类型的基准数据集，我们表明，从绝对值来看，使用结构信息始终能增强基线的预测能力。然而，随着使用更复杂的pLMs来表示节点特征，GNN架构所表示的结构信息的相对影响会减小。上述观察结果表明，尽管使用实验性蛋白质结构几乎总能提高结合位点预测的准确性，但即使不使用3D结构，复杂的pLMs仍包含导致良好预测性能的结构信息。