Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, 21 Nanyang Link, Singapore.
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S9. doi: 10.1186/1471-2105-11-S1-S9.
Prediction of protein structural classes (alpha, beta, alpha + beta and alpha/beta) from amino acid sequences is of great importance, as it is beneficial to study protein function, regulation and interactions. Many methods have been developed for high-homology protein sequences, and the prediction accuracies can achieve up to 90%. However, for low-homology sequences whose average pairwise sequence identity lies between 20% and 40%, they perform relatively poorly, yielding the prediction accuracy often below 60%.
We propose a new method to predict protein structural classes on the basis of features extracted from the predicted secondary structures of proteins rather than directly from their amino acid sequences. It first uses PSIPRED to predict the secondary structure for each protein sequence. Then, the chaos game representation is employed to represent the predicted secondary structure as two time series, from which we generate a comprehensive set of 24 features using recurrence quantification analysis, K-string based information entropy and segment-based analysis. The resulting feature vectors are finally fed into a simple yet powerful Fisher's discriminant algorithm for the prediction of protein structural classes. We tested the proposed method on three benchmark datasets in low homology and achieved the overall prediction accuracies of 82.9%, 83.1% and 81.3%, respectively. Comparisons with ten existing methods showed that our method consistently performs better for all the tested datasets and the overall accuracy improvements range from 2.3% to 27.5%. A web server that implements the proposed method is freely available at http://www1.spms.ntu.edu.sg/~chenxin/RKS_PPSC/.
The high prediction accuracy achieved by our proposed method is attributed to the design of a comprehensive feature set on the predicted secondary structure sequences, which is capable of characterizing the sequence order information, local interactions of the secondary structural elements, and spacial arrangements of alpha helices and beta strands. Thus, it is a valuable method to predict protein structural classes particularly for low-homology amino acid sequences.
从氨基酸序列预测蛋白质结构类别(α、β、α+β 和 α/β)非常重要,因为这有助于研究蛋白质的功能、调节和相互作用。许多方法已经被开发出来用于同源性高的蛋白质序列,其预测准确率可高达 90%。然而,对于平均序列同一性在 20%到 40%之间的低同源性序列,它们的表现相对较差,预测准确率通常低于 60%。
我们提出了一种新的方法,基于从蛋白质预测的二级结构中提取的特征,而不是直接从氨基酸序列预测蛋白质结构类别。它首先使用 PSIPRED 预测每个蛋白质序列的二级结构。然后,使用混沌游戏表示法将预测的二级结构表示为两个时间序列,从中我们使用递归量化分析、基于 K 串的信息熵和基于片段的分析生成一组综合的 24 个特征。生成的特征向量最后被输入到一个简单而强大的 Fisher 判别算法中,用于预测蛋白质结构类别。我们在三个低同源性基准数据集上测试了所提出的方法,分别获得了 82.9%、83.1%和 81.3%的总体预测准确率。与十种现有方法的比较表明,我们的方法在所有测试数据集上的表现都更好,整体准确率提高了 2.3%到 27.5%。实现所提出方法的 Web 服务器可在 http://www1.spms.ntu.edu.sg/~chenxin/RKS_PPSC/ 免费获得。
我们提出的方法之所以能达到如此高的预测准确率,是因为它设计了一个综合的特征集,用于预测二级结构序列,这些特征集能够描述序列顺序信息、二级结构元素的局部相互作用以及α螺旋和β链的空间排列。因此,这是一种预测蛋白质结构类别的有价值的方法,特别是对于低同源性的氨基酸序列。