Cai C Z, Han L Y, Ji Z L, Chen Y Z
Department of Applied Physics, Chongqing University, Chongqing, Peoples Republic of China.
Proteins. 2004 Apr 1;55(1):66-76. doi: 10.1002/prot.20045.
One approach for facilitating protein function prediction is to classify proteins into functional families. Recent studies on the classification of G-protein coupled receptors and other proteins suggest that a statistical learning method, Support vector machines (SVM), may be potentially useful for protein classification into functional families. In this work, SVM is applied and tested on the classification of enzymes into functional families defined by the Enzyme Nomenclature Committee of IUBMB. SVM classification system for each family is trained from representative enzymes of that family and seed proteins of Pfam curated protein families. The classification accuracy for enzymes from 46 families and for non-enzymes is in the range of 50.0% to 95.7% and 79.0% to 100% respectively. The corresponding Matthews correlation coefficient is in the range of 54.1% to 96.1%. Moreover, 80.3% of the 8,291 correctly classified enzymes are uniquely classified into a specific enzyme family by using a scoring function, indicating that SVM may have certain level of unique prediction capability. Testing results also suggest that SVM in some cases is capable of classification of distantly related enzymes and homologous enzymes of different functions. Effort is being made to use a more comprehensive set of enzymes as training sets and to incorporate multi-class SVM classification systems to further enhance the unique prediction accuracy. Our results suggest the potential of SVM for enzyme family classification and for facilitating protein function prediction. Our software is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.
促进蛋白质功能预测的一种方法是将蛋白质分类到功能家族中。最近关于G蛋白偶联受体和其他蛋白质分类的研究表明,一种统计学习方法——支持向量机(SVM),可能对将蛋白质分类到功能家族中具有潜在的用处。在这项工作中,SVM被应用于将酶分类到由国际生物化学与分子生物学联盟酶学命名委员会定义的功能家族中,并进行了测试。每个家族的SVM分类系统是从该家族的代表性酶和Pfam精选蛋白质家族的种子蛋白中训练出来的。46个家族的酶和非酶的分类准确率分别在50.0%至95.7%和79.0%至100%的范围内。相应的马修斯相关系数在54.1%至96.1%的范围内。此外,通过使用评分函数,8291个正确分类的酶中有80.3%被唯一分类到特定的酶家族中,这表明SVM可能具有一定程度的独特预测能力。测试结果还表明,SVM在某些情况下能够对远缘相关的酶和不同功能的同源酶进行分类。目前正在努力使用更全面的酶集作为训练集,并纳入多类SVM分类系统,以进一步提高独特预测的准确性。我们的结果表明SVM在酶家族分类和促进蛋白质功能预测方面具有潜力。我们的软件可在http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi上获取。