Salmanian Sara, Pezeshk Hamid, Sadeghi Mehdi
Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran.
BMC Bioinformatics. 2020 Dec 17;21(1):584. doi: 10.1186/s12859-020-03930-7.
Predicting physical interaction between proteins is one of the greatest challenges in computational biology. There are considerable various protein interactions and a huge number of protein sequences and synthetic peptides with unknown interacting counterparts. Most of co-evolutionary methods discover a combination of physical interplays and functional associations. However, there are only a handful of approaches which specifically infer physical interactions. Hybrid co-evolutionary methods exploit inter-protein residue coevolution to unravel specific physical interacting proteins. In this study, we introduce a hybrid co-evolutionary-based approach to predict physical interplays between pairs of protein families, starting from protein sequences only.
In the present analysis, pairs of multiple sequence alignments are constructed for each dimer and the covariation between residues in those pairs are calculated by CCMpred (Contacts from Correlated Mutations predicted) and three mutual information based approaches for ten accessible surface area threshold groups. Then, whole residue couplings between proteins of each dimer are unified into a single Frobenius norm value. Norms of residue contact matrices of all dimers in different accessible surface area thresholds are fed into support vector machine as single or multiple feature models. The results of training the classifiers by single features show no apparent different accuracies in distinct methods for different accessible surface area thresholds. Nevertheless, mutual information product and context likelihood of relatedness procedures may roughly have an overall higher and lower performances than other two methods for different accessible surface area cut-offs, respectively. The results also demonstrate that training support vector machine with multiple norm features for several accessible surface area thresholds leads to a considerable improvement of prediction performance. In this context, CCMpred roughly achieves an overall better performance than mutual information based approaches. The best accuracy, sensitivity, specificity, precision and negative predictive value for that method are 0.98, 1, 0.962, 0.96, and 0.962, respectively.
In this paper, by feeding norm values of protein dimers into support vector machines in different accessible surface area thresholds, we demonstrate that even small number of proteins in pairs of multiple alignments could allow one to accurately discriminate between positive and negative dimers.
预测蛋白质之间的物理相互作用是计算生物学中最大的挑战之一。存在大量不同的蛋白质相互作用以及大量具有未知相互作用对应物的蛋白质序列和合成肽。大多数共进化方法发现的是物理相互作用和功能关联的组合。然而,专门推断物理相互作用的方法却寥寥无几。混合共进化方法利用蛋白质间残基共进化来揭示特定的物理相互作用蛋白质。在本研究中,我们引入一种基于混合共进化的方法,仅从蛋白质序列出发预测蛋白质家族对之间的物理相互作用。
在当前分析中,为每个二聚体构建多序列比对的配对,并通过CCMpred(预测相关突变的接触)和三种基于互信息的方法,针对十个可及表面积阈值组计算这些配对中残基之间的共变。然后,将每个二聚体蛋白质之间的整个残基耦合统一为单个弗罗贝尼乌斯范数。不同可及表面积阈值下所有二聚体的残基接触矩阵范数作为单特征或多特征模型输入支持向量机。单特征训练分类器的结果表明,在不同可及表面积阈值的不同方法中,准确率没有明显差异。然而,对于不同的可及表面积截断值,互信息乘积和相关性上下文似然程序分别大致比其他两种方法具有总体更高和更低的性能。结果还表明,针对几个可及表面积阈值使用多范数特征训练支持向量机可显著提高预测性能。在此背景下,CCMpred大致比基于互信息的方法具有总体更好的性能。该方法的最佳准确率、灵敏度、特异性、精确率和阴性预测值分别为0.98、1、0.962、0.96和0.962。
在本文中,通过将不同可及表面积阈值下蛋白质二聚体的范数输入支持向量机,我们证明即使是多序列比对对中的少量蛋白质也能让人准确区分正二聚体和负二聚体。