Chen Yu-Ching, Lin Yeong-Shin, Lin Chih-Jen, Hwang Jenn-Kang
Institute of Bioinformatics, National Chiao Tung University, HsinChu, Taiwan, ROC.
Proteins. 2004 Jun 1;55(4):1036-42. doi: 10.1002/prot.20079.
The support vector machine (SVM) method is used to predict the bonding states of cysteines. Besides using local descriptors such as the local sequences, we include global information, such as amino acid compositions and the patterns of the states of cysteines (bonded or nonbonded), or cysteine state sequences, of the proteins. We found that SVM based on local sequences or global amino acid compositions yielded similar prediction accuracies for the data set comprising 4136 cysteine-containing segments extracted from 969 nonhomologous proteins. However, the SVM method based on multiple feature vectors (combining local sequences and global amino acid compositions) significantly improves the prediction accuracy, from 80% to 86%. If coupled with cysteine state sequences, SVM based on multiple feature vectors yields 90% in overall prediction accuracy and a 0.77 Matthews correlation coefficient, around 10% and 22% higher than the corresponding values obtained by SVM based on local sequence information.
支持向量机(SVM)方法用于预测半胱氨酸的结合状态。除了使用局部描述符(如局部序列)外,我们还纳入了全局信息,如蛋白质的氨基酸组成以及半胱氨酸的状态模式(结合或未结合),即半胱氨酸状态序列。我们发现,基于局部序列或全局氨基酸组成的支持向量机,对于从969个非同源蛋白质中提取的4136个含半胱氨酸片段的数据集,产生了相似的预测准确率。然而,基于多个特征向量(结合局部序列和全局氨基酸组成)的支持向量机方法显著提高了预测准确率,从80%提高到86%。如果与半胱氨酸状态序列相结合,基于多个特征向量的支持向量机在总体预测准确率上达到90%,马修斯相关系数为0.77,分别比基于局部序列信息的支持向量机获得的相应值高出约10%和22%。