Hong Huixiao, Hong Qilong, Perkins Roger, Shi Leming, Fang Hong, Su Zhenqiang, Dragan Yvonne, Fuscoe James C, Tong Weida
Division of Systems Toxicology, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas 72079, USA.
J Comput Biol. 2009 Dec;16(12):1671-88. doi: 10.1089/cmb.2008.0115.
The rapid advances in proteomic analyses coupled with the completion of multiple genomes have led to an increased demand for determining protein functions. The first step is classification or prediction into families. A method was developed for the prediction of protein family based only on protein sequence using support vector machine (SVM) models. In these models, the amino acids were classified into three categories (apolar, polar, and charged). Consecutive fragments ranging from one to five were annotated by amino acid type to define the protein features of each protein. SVM models were constructed based on the protein features of a training set of proteins and then examined with an independent set of proteins. The approach was tested for 20 protein families from the iProClass database of Protein Information Resources (PIR). For two-class SVM models, an average prediction accuracy of 0.9985 was achieved, while for multi-class SVM models an accuracy of 0.9941 was achieved. This study demonstrates that SVM based methods can accurately recognize and predict the protein family to which a sequence belongs based solely on its primary amino acid sequence.
蛋白质组学分析的快速进展以及多个基因组测序的完成,使得确定蛋白质功能的需求不断增加。第一步是对蛋白质进行分类或预测其所属家族。开发了一种仅基于蛋白质序列,利用支持向量机(SVM)模型预测蛋白质家族的方法。在这些模型中,氨基酸被分为三类(非极性、极性和带电)。从一到五个连续的片段通过氨基酸类型进行注释,以定义每个蛋白质的特征。基于一组训练蛋白质的特征构建支持向量机模型,然后用一组独立的蛋白质进行检验。该方法在蛋白质信息资源(PIR)的iProClass数据库中的20个蛋白质家族上进行了测试。对于两类支持向量机模型,平均预测准确率达到0.9985,而对于多类支持向量机模型,准确率达到0.9941。这项研究表明,基于支持向量机的方法能够仅根据蛋白质的一级氨基酸序列准确识别和预测其所属的蛋白质家族。