使用随机森林算法预测β-发夹基序。

Using random forest algorithm to predict β-hairpin motifs.

作者信息

Jia Shao-Chun, Hu Xiu-Zhen

机构信息

College of Sciences, Inner Mongolia University of Technology, Hohhot, 010051. P.R China.

出版信息

Protein Pept Lett. 2011 Jun;18(6):609-17. doi: 10.2174/092986611795222777.

DOI:10.2174/092986611795222777

PMID:21309739

Abstract

A novel method is presented for predicting β-hairpin motifs in protein sequences. That is Random Forest algorithm on the basis of the multi-characteristic parameters, which include amino acids component of position, hydropathy component of position, predicted secondary structure information and value of auto-correlation function. Firstly, the method is trained and tested on a set of 8,291 β-hairpin motifs and 6,865 non-β-hairpin motifs. The overall accuracy and Matthew's correlation coefficient achieve 82.2% and 0.64 using 5-fold cross-validation, while they achieve 81.7% and 0.63 using the independent test. Secondly, the method is also tested on a set of 4,884 β-hairpin motifs and 4,310 non-β-hairpin motifs which is used in previous studies. The overall accuracy and Matthew's correlation coefficient achieve 80.9% and 0.61 for 5-fold cross-validation, while they achieve 80.6% and 0.60 for the independent test. Compared with the previous, the present result is better. Thirdly, 4,884 β-hairpin motifs and 4,310 non-β-hairpin motifs selected as the training set, and 8,291 β-hairpin motifs and 6,865 non-β-hairpin motifs selected as the independent testing set, the overall accuracy and Matthew's correlation coefficient achieve 81.5% and 0.63 with the independent test.

摘要

提出了一种预测蛋白质序列中β-发夹基序的新方法。即基于多特征参数的随机森林算法，这些参数包括位置的氨基酸组成、位置的亲水性组成、预测的二级结构信息和自相关函数值。首先，该方法在一组8291个β-发夹基序和6865个非β-发夹基序上进行训练和测试。使用5折交叉验证时，总体准确率和马修斯相关系数分别达到82.2%和0.64，而使用独立测试时分别达到81.7%和0.63。其次，该方法还在先前研究中使用的一组4884个β-发夹基序和4310个非β-发夹基序上进行测试。5折交叉验证时，总体准确率和马修斯相关系数分别达到80.9%和0.61，而独立测试时分别达到80.6%和0.60。与之前相比，目前的结果更好。第三，选择4884个β-发夹基序和4310个非β-发夹基序作为训练集，8291个β-发夹基序和6865个非β-发夹基序作为独立测试集，独立测试时总体准确率和马修斯相关系数达到81.5%和0.63。