Liao Bo, Peng Ting, Chen Haowen, Lin Yaping
College of Information science and Engineering, Hunan University, Changsha, Hunan, 410082, China.
Protein Pept Lett. 2013 Oct;20(10):1079-87. doi: 10.2174/09298665113209990002.
Knowledge of structural classes is applied in numerous important predictive tasks that address structural and functional features of proteins, although the prediction accuracy of the protein structural classes is not high. In this study, 45 different features were rationally designed to model the differences between protein structural classes, among which, 30 of them reflect the combined protein sequence information. In terms of correlation function, the protein sequence can be converted to a digital signal sequence, from which we can generate 20 discrete Fourier spectrum numbers. According to the segments of amino with different characteristics occurring in protein sequences, the frequencies of the 10 kinds of segments of amino acid (motifs) in protein are calculated. Other features include the secondary structural information :10 features were proposed to model the strong adjacent correlations in the secondary structural elements and capture the long-range spatial interactions between secondary structures, other 5 features were designed to differentiate α/β from α+β classes , which is a major problem of the existing algorithm. The methods were proposed based on a large set of low-identity sequences for which secondary structure is predicted from their sequence (based on PSI-PRED). By means of this method, the overall prediction accuracy of four benchmark datasets were all improved. Especially for the dataset FC699, 25PDB and D1189 which are 1.26%, 1% and 0.85% higher than the best previous method respectively.
尽管蛋白质结构类别的预测准确性不高,但结构类别的知识仍应用于许多重要的预测任务中,这些任务涉及蛋白质的结构和功能特征。在本研究中,合理设计了45种不同的特征来模拟蛋白质结构类别之间的差异,其中30种反映了蛋白质序列的组合信息。就相关函数而言,蛋白质序列可以转换为数字信号序列,从中我们可以生成20个离散傅里叶频谱数。根据蛋白质序列中出现的具有不同特征的氨基酸片段,计算蛋白质中10种氨基酸片段(基序)的频率。其他特征包括二级结构信息:提出了10个特征来模拟二级结构元件中的强相邻相关性,并捕捉二级结构之间的远程空间相互作用,另外5个特征旨在区分α/β类和α+β类,这是现有算法的一个主要问题。这些方法是基于大量低同源性序列提出的,其二级结构是根据它们的序列预测的(基于PSI-PRED)。通过这种方法,四个基准数据集的总体预测准确性均得到了提高。特别是对于数据集FC699、25PDB和D1189,分别比之前最好的方法高出1.26%、1%和0.85%。