School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China.
School of Life Sciences, Zhengzhou University, Zhengzhou 450001, China.
Biomolecules. 2024 Sep 27;14(10):1220. doi: 10.3390/biom14101220.
Proteins perform different biological functions through binding with various molecules which are mediated by a few key residues and accurate prediction of such protein binding residues (PBRs) is crucial for understanding cellular processes and for designing new drugs. Many computational prediction approaches have been proposed to identify PBRs with sequence-based features. However, these approaches face two main challenges: (1) these methods only concatenate residue feature vectors with a simple sliding window strategy, and (2) it is challenging to find a uniform sliding window size suitable for learning embeddings across different types of PBRs. In this study, we propose one novel framework that could apply multiple types of PBRs rediciton task through ulti-scale equence-based eature usion (PMSFF) strategy. Firstly, PMSFF employs a pre-trained language model named ProtT5, to encode amino acid residues in protein sequences. Then, it generates multi-scale residue embeddings by applying multi-size windows to capture effective neighboring residues and multi-size kernels to learn information across different scales. Additionally, the proposed model treats protein sequences as sentences, employing a bidirectional GRU to learn global context. We also collect benchmark datasets encompassing various PBRs types and evaluate our PMSFF approach to these datasets. Compared with state-of-the-art methods, PMSFF demonstrates superior performance on most PBRs prediction tasks.
蛋白质通过与各种分子结合来执行不同的生物学功能,这些结合是由少数关键残基介导的,因此准确预测这些蛋白质结合残基(PBR)对于理解细胞过程和设计新药至关重要。已经提出了许多基于序列特征的计算预测方法来识别 PBR。然而,这些方法面临两个主要挑战:(1)这些方法仅使用简单的滑动窗口策略将残基特征向量连接在一起,(2)很难找到适用于学习不同类型 PBR 嵌入的统一滑动窗口大小。在这项研究中,我们提出了一个新颖的框架,该框架可以通过多尺度基于序列的特征融合(PMSFF)策略应用于多种类型的 PBR 预测任务。首先,PMSFF 使用一种名为 ProtT5 的预训练语言模型对蛋白质序列中的氨基酸残基进行编码。然后,它通过应用多大小窗口来捕获有效相邻残基和多大小内核来学习不同尺度上的信息,从而生成多尺度残基嵌入。此外,所提出的模型将蛋白质序列视为句子,使用双向 GRU 来学习全局上下文。我们还收集了包含各种 PBR 类型的基准数据集,并将我们的 PMSFF 方法评估这些数据集。与最先进的方法相比,PMSFF 在大多数 PBR 预测任务上表现出优异的性能。