Melvin Iain, Ie Eugene, Kuang Rui, Weston Jason, Stafford William Noble, Leslie Christina
NEC Laboratories of America, Princeton, NJ, USA.
BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.
Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community.
We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at http://svm-fold.c2b2.columbia.edu. Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider.
By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition.
从氨基酸序列预测蛋白质的结构类别是计算生物学中的一个基本问题。最近的许多工作都集中在开发用于蛋白质序列的新表示方法,即字符串核,以用于支持向量机(SVM)分类器。然而,虽然这些方法中的一些在二元蛋白质分类问题上展现出了最先进的性能,即在区分特定蛋白质类别与所有其他类别方面,但这些研究中很少有涉及多类超家族或折叠识别的实际问题。此外,生物信息学领域中基于支持向量机的蛋白质分类可用的软件工具和系统非常有限。
我们提出了一种新的基于支持向量机的多类蛋白质折叠和超家族识别系统及网络服务器,称为SVM-Fold,可在http://svm-fold.c2b2.columbia.edu上找到。我们的系统使用了一种针对序列概况的最先进字符串核的高效实现方法,称为概况核,其基础特征表示是不精确匹配的k-mer频率直方图。我们还采用了一种新颖的机器学习方法来解决将氨基酸序列分类到许多已知蛋白质结构类别之一的困难多类问题。训练用于识别单个结构类别的二元一对其余支持向量机分类器产生的预测分数不可比,因此标准的“一对所有”分类效果不佳。此外,针对蛋白质结构层次不同级别的类别的支持向量机可能会做出有用的预测,但一对所有方法不会尝试组合这些多个预测。为了解决这些问题,我们的方法学习一对其余分类器之间的相对权重,并对用于多类预测的蛋白质结构层次信息进行编码。在基于SCOP数据库的大规模基准测试结果中,我们的代码加权方法在远程同源设置下的超家族和折叠预测以及折叠识别问题上,相对于标准的一对所有方法有显著改进。此外,在我们考虑的每个结构分类问题上,我们的代码权重学习算法在预测准确性方面明显优于基于PSI-BLAST的最近邻方法。
通过将最先进的支持向量机核方法与新颖的多类算法相结合,SVM-Fold系统实现了高效且准确的蛋白质折叠和超家族识别。