College of Computer Science, Chongqing University, Chongqing 400044, China.
J Comput Chem. 2009 Nov 15;30(14):2277-84. doi: 10.1002/jcc.21229.
On the basis of the features of protein sequential pattern, we used the method of increment of diversity combined with quadratic discriminant analysis (IDQD) to predict beta-hairpins motifs in protein sequences. Three rules are used to extract the raw beta-beta motifs sequential patterns for fixed-length. Amino acid basic compositions, dipeptide components, and amino acid composition distribution are combined to represent the compositional features. Eighteen feature variables on a sequential pattern to be predicted are defined in terms of ID. They are integrated in a single formal framework given by IDQD. The method is trained and tested on ArchDB40 dataset containing 3088 proteins. The overall accuracy of prediction and Matthew's correlation coefficient for the independent testing dataset are 81.7% and 0.60, respectively. In addition, a higher accuracy of 84.5% and Matthew's correlation coefficient of 0.68 for the independent testing dataset are obtained on a dataset previously used by Kumar et al. (Nucleic Acids Res 2005, 33, 154), which contains 2088 proteins. For a fair assessment of our method, the performance is also evaluated on all 63 proteins used in CASP6. The overall accuracy of prediction is 74.2% for the independent testing dataset.
基于蛋白质序列模式的特点,我们使用多样性增量结合二次判别分析(IDQD)的方法来预测蛋白质序列中的β发夹基序。使用三种规则从原始β-β基序序列中提取固定长度的序列模式。氨基酸组成、二肽组成和氨基酸组成分布相结合来表示组成特征。在 ID 方面,对要预测的序列模式定义了 18 个特征变量。它们集成在由 IDQD 给出的单个正式框架中。该方法在包含 3088 个蛋白质的 ArchDB40 数据集上进行了训练和测试。独立测试数据集的整体预测准确率和 Matthew 相关系数分别为 81.7%和 0.60。此外,在 Kumar 等人以前使用的数据集(Nucleic Acids Res 2005,33,154)上,对独立测试数据集的准确率更高,达到 84.5%,Matthew 相关系数为 0.68,该数据集包含 2088 个蛋白质。为了公平评估我们的方法,还在 CASP6 中使用的 63 个蛋白质上评估了该方法的性能。独立测试数据集的总体预测准确率为 74.2%。