Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel.
Proteins. 2011 Jun;79(6):1952-63. doi: 10.1002/prot.23020. Epub 2011 Apr 12.
The identification of catalytic residues is an essential step in functional characterization of enzymes. We present a purely structural approach to this problem, which is motivated by the difficulty of evolution-based methods to annotate structural genomics targets that have few or no homologs in the databases. Our approach combines a state-of-the-art support vector machine (SVM) classifier with novel structural features that augment structural clues by spatial averaging and Z scoring. Special attention is paid to the class imbalance problem that stems from the overwhelming number of non-catalytic residues in enzymes compared to catalytic residues. This problem is tackled by: (1) optimizing the classifier to maximize a performance criterion that considers both Type I and Type II errors in the classification of catalytic and non-catalytic residues; (2) under-sampling non-catalytic residues before SVM training; and (3) during SVM training, penalizing errors in learning catalytic residues more than errors in learning non-catalytic residues. Tested on four enzyme datasets, one specifically designed by us to mimic the structural genomics scenario and three previously evaluated datasets, our structure-based classifier is never inferior to similar structure-based classifiers and comparable to classifiers that use both structural and evolutionary features. In addition to the evaluation of the performance of catalytic residue identification, we also present detailed case studies on three proteins. This analysis suggests that many false positive predictions may correspond to binding sites and other functional residues. A web server that implements the method, our own-designed database, and the source code of the programs are publicly available at http://www.cs.bgu.ac.il/∼meshi/functionPrediction.
催化残基的鉴定是酶功能特征分析的重要步骤。我们提出了一种纯粹基于结构的方法来解决这个问题,这种方法是受到基于进化的方法难以注释结构基因组学靶标(这些靶标在数据库中只有很少或没有同源物)的启发。我们的方法将最先进的支持向量机(SVM)分类器与新的结构特征相结合,通过空间平均和 Z 评分来增强结构线索。特别关注由于酶中的催化残基数量相对于非催化残基数量压倒性地多而导致的类别不平衡问题。通过以下三种方法解决这个问题:(1)通过优化分类器,最大化考虑催化和非催化残基分类中的 I 型和 II 型错误的性能标准;(2)在 SVM 训练之前对非催化残基进行欠采样;(3)在 SVM 训练过程中,对学习催化残基的错误比学习非催化残基的错误进行更多的惩罚。在四个酶数据集上进行测试,其中一个是我们专门设计的来模拟结构基因组学场景的数据集,另外三个是之前评估过的数据集,我们基于结构的分类器在性能上从不逊于类似的基于结构的分类器,也可与使用结构和进化特征的分类器相媲美。除了对催化残基鉴定性能的评估外,我们还对三个蛋白质进行了详细的案例研究。该分析表明,许多假阳性预测可能对应于结合位点和其他功能残基。一个实现该方法的网络服务器、我们自己设计的数据库以及程序的源代码都可以在 http://www.cs.bgu.ac.il/∼meshi/functionPrediction 上公开获取。