YASSPP：更好的核函数和编码方案可改善蛋白质二级结构预测。

YASSPP: better kernels and coding schemes lead to improvements in protein secondary structure prediction.

作者信息

Karypis George

机构信息

Department of Computer Science & Engineering, University of Minnesota, Army HPC Research Center, Minneapolis, Minnesota 55455, USA.

出版信息

Proteins. 2006 Aug 15;64(3):575-86. doi: 10.1002/prot.21036.

DOI:10.1002/prot.21036

PMID:16763996

Abstract

The accurate prediction of a protein's secondary structure plays an increasingly critical role in predicting its function and tertiary structure, as it is utilized by many of the current state-of-the-art methods for remote homology, fold recognition, and ab initio structure prediction. We developed a new secondary structure prediction algorithm called YASSPP, which uses a pair of cascaded models constructed from two sets of binary SVM-based models. YASSPP uses an input coding scheme that combines both position-specific and nonposition-specific information, utilizes a kernel function designed to capture the sequence conservation signals around the local window of each residue, and constructs a second-level model by incorporating both the three-state predictions produced by the first-level model and information about the original sequence. Experiments on three standard datasets (RS126, CB513, and EVA common subset 4) show that YASSPP is capable of producing the highest Q3 and SOV scores than that achieved by existing widely used schemes such as PSIPRED, SSPro 4.0, SAM-T99sec, as well as previously developed SVM-based schemes. On the EVA dataset it achieves a Q3 and SOV score of 79.34 and 78.65%, which are considerably higher than the best reported scores of 77.64 and 76.05%, respectively.

摘要

准确预测蛋白质的二级结构在预测其功能和三级结构方面发挥着越来越关键的作用，因为许多当前最先进的远程同源性、折叠识别和从头结构预测方法都利用了这一信息。我们开发了一种名为YASSPP的新二级结构预测算法，该算法使用由两组基于二元支持向量机的模型构建的一对级联模型。YASSPP使用一种结合了位置特异性和非位置特异性信息的输入编码方案，利用一个旨在捕获每个残基局部窗口周围序列保守信号的核函数，并通过合并一级模型产生的三态预测和原始序列信息来构建二级模型。在三个标准数据集（RS126、CB513和EVA公共子集4）上进行的实验表明，YASSPP能够产生比现有广泛使用的方案（如PSIPRED、SSPro 4.0、SAM-T99sec）以及先前开发的基于支持向量机的方案更高的Q3和SOV分数。在EVA数据集上，它的Q3和SOV分数分别达到79.34%和78.65%，大大高于之前报道的最佳分数77.64%和76.05%。