Division of Mathematical Sciences, School of Physical & Mathematical Sciences, Nanyang Technological University, Singapore 637371.
J Theor Biol. 2009 Apr 21;257(4):618-26. doi: 10.1016/j.jtbi.2008.12.027. Epub 2009 Jan 8.
In this paper, we intend to predict protein structural classes (alpha, beta, alpha+beta, or alpha/beta) for low-homology data sets. Two data sets were used widely, 1189 (containing 1092 proteins) and 25PDB (containing 1673 proteins) with sequence homology being 40% and 25%, respectively. We propose to decompose the chaos game representation of proteins into two kinds of time series. Then, a novel and powerful nonlinear analysis technique, recurrence quantification analysis (RQA), is applied to analyze these time series. For a given protein sequence, a total of 16 characteristic parameters can be calculated with RQA, which are treated as feature representation of protein sequences. Based on such feature representation, the structural class for each protein is predicted with Fisher's linear discriminant algorithm. The jackknife test is used to test and compare our method with other existing methods. The overall accuracies with step-by-step procedure are 65.8% and 64.2% for 1189 and 25PDB data sets, respectively. With one-against-others procedure used widely, we compare our method with five other existing methods. Especially, the overall accuracies of our method are 6.3% and 4.1% higher for the two data sets, respectively. Furthermore, only 16 parameters are used in our method, which is less than that used by other methods. This suggests that the current method may play a complementary role to the existing methods and is promising to perform the prediction of protein structural classes.
在本文中,我们旨在预测低同源性数据集的蛋白质结构类别(α、β、α+β 或 α/β)。我们使用了两个广泛使用的数据集,1189(包含 1092 个蛋白质)和 25PDB(包含 1673 个蛋白质),序列同源性分别为 40%和 25%。我们建议将蛋白质的混沌游戏表示分解为两种时间序列。然后,应用一种新颖而强大的非线性分析技术——递归量化分析(RQA)来分析这些时间序列。对于给定的蛋白质序列,可以用 RQA 计算总共 16 个特征参数,这些参数被视为蛋白质序列的特征表示。基于这种特征表示,使用 Fisher 的线性判别算法预测每个蛋白质的结构类别。Jackknife 测试用于测试和比较我们的方法与其他现有方法。逐步程序的整体准确率分别为 65.8%和 64.2%,用于 1189 和 25PDB 数据集。广泛使用一对一比较程序,我们将我们的方法与其他五种现有方法进行比较。特别是,我们的方法在这两个数据集上的整体准确率分别高出 6.3%和 4.1%。此外,我们的方法仅使用 16 个参数,少于其他方法使用的参数。这表明当前的方法可能对现有方法起到补充作用,并有望进行蛋白质结构类别的预测。