Dobson Paul D, Doig Andrew J
Department of Biomolecular Sciences, UMIST, P.O. Box 88, Manchester M60 1QD, UK.
J Mol Biol. 2003 Jul 18;330(4):771-83. doi: 10.1016/s0022-2836(03)00628-4.
The ability to predict protein function from structure is becoming increasingly important as the number of structures resolved is growing more rapidly than our capacity to study function. Current methods for predicting protein function are mostly reliant on identifying a similar protein of known function. For proteins that are highly dissimilar or are only similar to proteins also lacking functional annotations, these methods fail. Here, we show that protein function can be predicted as enzymatic or not without resorting to alignments. We describe 1178 high-resolution proteins in a structurally non-redundant subset of the Protein Data Bank using simple features such as secondary-structure content, amino acid propensities, surface properties and ligands. The subset is split into two functional groupings, enzymes and non-enzymes. We use the support vector machine-learning algorithm to develop models that are capable of assigning the protein class. Validation of the method shows that the function can be predicted to an accuracy of 77% using 52 features to describe each protein. An adaptive search of possible subsets of features produces a simplified model based on 36 features that predicts at an accuracy of 80%. We compare the method to sequence-based methods that also avoid calculating alignments and predict a recently released set of unrelated proteins. The most useful features for distinguishing enzymes from non-enzymes are secondary-structure content, amino acid frequencies, number of disulphide bonds and size of the largest cleft. This method is applicable to any structure as it does not require the identification of sequence or structural similarity to a protein of known function.
随着解析出的蛋白质结构数量的增长速度超过了我们研究其功能的能力,从结构预测蛋白质功能的能力变得越来越重要。当前预测蛋白质功能的方法大多依赖于识别具有已知功能的相似蛋白质。对于那些高度不相似或仅与同样缺乏功能注释的蛋白质相似的蛋白质,这些方法就失效了。在此,我们表明无需借助序列比对就能预测蛋白质是否具有酶活性。我们使用诸如二级结构含量、氨基酸倾向、表面性质和配体等简单特征,描述了蛋白质数据库中一个结构非冗余子集中的1178种高分辨率蛋白质。该子集被分为两个功能类别,即酶和非酶。我们使用支持向量机学习算法来开发能够对蛋白质类别进行分类的模型。对该方法的验证表明,使用52个特征来描述每种蛋白质时,功能预测的准确率可达77%。对可能的特征子集进行自适应搜索,可得到一个基于36个特征的简化模型,其预测准确率为80%。我们将该方法与同样避免计算序列比对并预测一组最近发布的不相关蛋白质的基于序列的方法进行了比较。区分酶和非酶最有用的特征是二级结构含量、氨基酸频率、二硫键数量和最大裂隙的大小。该方法适用于任何结构,因为它不需要识别与已知功能蛋白质的序列或结构相似性。