Faculty of Information Technology, VNU University of Engineering and Technology, 144 Xuan Thuy, Hanoi, 10000, Vietnam.
Faculty of Biology, VNU University of Science, 334 Nguyen Trai, Hanoi, 10000, Vietnam.
BMC Bioinformatics. 2024 Mar 10;25(1):106. doi: 10.1186/s12859-024-05725-6.
Predicting protein-protein interactions (PPIs) from sequence data is a key challenge in computational biology. While various computational methods have been proposed, the utilization of sequence embeddings from protein language models, which contain diverse information, including structural, evolutionary, and functional aspects, has not been fully exploited. Additionally, there is a significant need for a comprehensive neural network capable of efficiently extracting these multifaceted representations.
Addressing this gap, we propose xCAPT5, a novel hybrid classifier that uniquely leverages the T5-XL-UniRef50 protein large language model for generating rich amino acid embeddings from protein sequences. The core of xCAPT5 is a multi-kernel deep convolutional siamese neural network, which effectively captures intricate interaction features at both micro and macro levels, integrated with the XGBoost algorithm, enhancing PPIs classification performance. By concatenating max and average pooling features in a depth-wise manner, xCAPT5 effectively learns crucial features with low computational cost.
This study represents one of the initial efforts to extract informative amino acid embeddings from a large protein language model using a deep and wide convolutional network. Experimental results show that xCAPT5 outperforms recent state-of-the-art methods in binary PPI prediction, excelling in cross-validation on several benchmark datasets and demonstrating robust generalization across intra-species, cross-species, inter-species, and stringent similarity contexts.
从序列数据中预测蛋白质-蛋白质相互作用(PPIs)是计算生物学中的一个关键挑战。虽然已经提出了各种计算方法,但尚未充分利用包含结构、进化和功能等多方面信息的蛋白质语言模型序列嵌入。此外,需要一种全面的神经网络来有效地提取这些多方面的表示。
为了解决这一差距,我们提出了 xCAPT5,这是一种新颖的混合分类器,它独特地利用了 T5-XL-UniRef50 蛋白质大型语言模型,从蛋白质序列中生成丰富的氨基酸嵌入。xCAPT5 的核心是一个多内核深度卷积孪生神经网络,它有效地捕获了微观和宏观层面上复杂的相互作用特征,并与 XGBoost 算法集成,提高了 PPIs 分类性能。通过以深度方式串联最大池化和平均池化特征,xCAPT5 可以有效地学习具有低计算成本的关键特征。
这项研究是使用深度和广泛的卷积网络从大型蛋白质语言模型中提取信息丰富的氨基酸嵌入的初步尝试之一。实验结果表明,xCAPT5 在二项 PPI 预测方面优于最新的最先进方法,在几个基准数据集的交叉验证中表现出色,并在种内、种间、跨物种和严格相似性上下文中具有稳健的泛化能力。