School of Computer Science and Technology and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Shanghai Key Laboratory of Intelligent Information Processing, Shanghai 200433, China, Gordon Life Science Institute, Belmont, MA 02478, USA, School of Computer, Shenyang Aerospace University, Shenyang, Liaoning, China, School of Computer Science, Fudan University, Shanghai 200433, China and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia.
Bioinformatics. 2014 Feb 15;30(4):472-9. doi: 10.1093/bioinformatics/btt709. Epub 2013 Dec 5.
Owing to its importance in both basic research (such as molecular evolution and protein attribute prediction) and practical application (such as timely modeling the 3D structures of proteins targeted for drug development), protein remote homology detection has attracted a great deal of interest. It is intriguing to note that the profile-based approach is promising and holds high potential in this regard. To further improve protein remote homology detection, a key step is how to find an optimal means to extract the evolutionary information into the profiles.
Here, we propose a novel approach, the so-called profile-based protein representation, to extract the evolutionary information via the frequency profiles. The latter can be calculated from the multiple sequence alignments generated by PSI-BLAST. Three top performing sequence-based kernels (SVM-Ngram, SVM-pairwise and SVM-LA) were combined with the profile-based protein representation. Various tests were conducted on a SCOP benchmark dataset that contains 54 families and 23 superfamilies. The results showed that the new approach is promising, and can obviously improve the performance of the three kernels. Furthermore, our approach can also provide useful insights for studying the features of proteins in various families. It has not escaped our notice that the current approach can be easily combined with the existing sequence-based methods so as to improve their performance as well.
For users' convenience, the source code of generating the profile-based proteins and the multiple kernel learning was also provided at http://bioinformatics.hitsz.edu.cn/main/~binliu/remote/
由于蛋白质远程同源检测在基础研究(如分子进化和蛋白质属性预测)和实际应用(如及时为药物开发的目标蛋白质建模 3D 结构)中都非常重要,因此引起了广泛关注。有趣的是,基于轮廓的方法在这方面很有前途,具有很大的潜力。为了进一步提高蛋白质远程同源检测的性能,关键步骤是如何找到一种最佳的方法将进化信息提取到轮廓中。
在这里,我们提出了一种新的方法,即基于轮廓的蛋白质表示方法,通过频率轮廓来提取进化信息。后者可以从 PSI-BLAST 生成的多重序列比对中计算出来。我们将三种表现最佳的基于序列的核函数(SVM-Ngram、SVM-pairwise 和 SVM-LA)与基于轮廓的蛋白质表示方法相结合。在包含 54 个家族和 23 个超家族的 SCOP 基准数据集上进行了各种测试。结果表明,新方法很有前途,可以明显提高这三种核函数的性能。此外,我们的方法还可以为研究各种家族中蛋白质的特征提供有用的见解。我们注意到,当前的方法可以很容易地与现有的基于序列的方法结合使用,以提高它们的性能。
为了方便用户,还在 http://bioinformatics.hitsz.edu.cn/main/~binliu/remote/ 提供了生成基于轮廓的蛋白质和多内核学习的源代码。