Chen Ke, Kurgan Lukasz A, Ruan Jishou
Department of Electrical and Computer Engineering, ECERF, University of Alberta, Edmonton, Alberta, Canada.
J Comput Chem. 2008 Jul 30;29(10):1596-604. doi: 10.1002/jcc.20918.
Knowledge of structural classes is useful in understanding of folding patterns in proteins. Although existing structural class prediction methods applied virtually all state-of-the-art classifiers, many of them use a relatively simple protein sequence representation that often includes amino acid (AA) composition. To this end, we propose a novel sequence representation that incorporates evolutionary information encoded using PSI-BLAST profile-based collocation of AA pairs. We used six benchmark datasets and five representative classifiers to quantify and compare the quality of the structural class prediction with the proposed representation. The best, classifier support vector machine achieved 61-96% accuracy on the six datasets. These predictions were comprehensively compared with a wide range of recently proposed methods for prediction of structural classes. Our comprehensive comparison shows superiority of the proposed representation, which results in error rate reductions that range between 14% and 26% when compared with predictions of the best-performing, previously published classifiers on the considered datasets. The study also shows that, for the benchmark dataset that includes sequences characterized by low identity (i.e., 25%, 30%, and 40%), the prediction accuracies are 20-35% lower than for the other three datasets that include sequences with a higher degree of similarity. In conclusion, the proposed representation is shown to substantially improve the accuracy of the structural class prediction. A web server that implements the presented prediction method is freely available at http://biomine.ece.ualberta.ca/Structural_Class/SCEC.html.
了解蛋白质的结构类别有助于理解蛋白质的折叠模式。尽管现有的结构类别预测方法几乎应用了所有最先进的分类器,但其中许多方法使用的是相对简单的蛋白质序列表示,通常包括氨基酸(AA)组成。为此,我们提出了一种新颖的序列表示方法,该方法纳入了使用基于PSI-BLAST概况的氨基酸对搭配编码的进化信息。我们使用了六个基准数据集和五个代表性分类器,以量化和比较使用所提出的表示方法进行结构类别预测的质量。最佳分类器支持向量机在这六个数据集上的准确率达到了61%-96%。这些预测结果与最近提出的各种结构类别预测方法进行了全面比较。我们的全面比较显示了所提出的表示方法的优越性,与在所考虑的数据集上表现最佳的先前发表的分类器的预测结果相比,其错误率降低了14%至26%。该研究还表明,对于包含低同源性(即25%、30%和40%)序列的基准数据集,预测准确率比其他三个包含更高相似度序列的数据集低20%-35%。总之,所提出的表示方法被证明能显著提高结构类别预测的准确率。实现所提出预测方法的网络服务器可在http://biomine.ece.ualberta.ca/Structural_Class/SCEC.html上免费获取。