Departamento de Física Geral, Instituto de Física, Universidade de São Paulo, SP, Brazil.
Comput Biol Chem. 2012 Oct;40:15-9. doi: 10.1016/j.compbiolchem.2012.06.003. Epub 2012 Aug 14.
In protein databases there is a substantial number of proteins structurally determined but without function annotation. Understanding the relationship between function and structure can be useful to predict function on a large scale. We have analyzed the similarities in global physicochemical parameters for a set of enzymes which were classified according to the four Enzyme Commission (EC) hierarchical levels. Using relevance theory we introduced a distance between proteins in the space of physicochemical characteristics. This was done by minimizing a cost function of the metric tensor built to reflect the EC classification system. Using an unsupervised clustering method on a set of 1025 enzymes, we obtained no relevant clustering formation compatible with EC classification. The distance distributions between enzymes from the same EC group and from different EC groups were compared by histograms. Such analysis was also performed using sequence alignment similarity as a distance. Our results suggest that global structure parameters are not sufficient to segregate enzymes according to EC hierarchy. This indicates that features essential for function are rather local than global. Consequently, methods for predicting function based on global attributes should not obtain high accuracy in main EC classes prediction without relying on similarities between enzymes from training and validation datasets. Furthermore, these results are consistent with a substantial number of studies suggesting that function evolves fundamentally by recruitment, i.e., a same protein motif or fold can be used to perform different enzymatic functions and a few specific amino acids (AAs) are actually responsible for enzyme activity. These essential amino acids should belong to active sites and an effective method for predicting function should be able to recognize them.
在蛋白质数据库中,有大量结构已确定但功能尚未注释的蛋白质。了解功能与结构之间的关系有助于大规模预测功能。我们分析了根据酶的四个酶委员会 (EC) 层次结构分类的一组酶的全局物理化学参数之间的相似性。我们使用相关性理论,在物理化学特征空间中引入了蛋白质之间的距离。这是通过最小化构建的度量张量的成本函数来实现的,该张量用于反映 EC 分类系统。我们使用无监督聚类方法对一组 1025 种酶进行聚类,没有获得与 EC 分类兼容的相关聚类形成。通过直方图比较来自同一 EC 组和不同 EC 组的酶之间的距离分布。还使用序列比对相似性作为距离进行了这种分析。我们的结果表明,全局结构参数不足以根据 EC 层次结构对酶进行分类。这表明对于功能至关重要的特征是局部的而不是全局的。因此,基于全局属性预测功能的方法,如果不依赖于训练和验证数据集之间的酶相似性,则不应在主要 EC 类预测中获得高精度。此外,这些结果与大量研究一致,这些研究表明功能基本上是通过招募来进化的,即相同的蛋白质基序或折叠可以用于执行不同的酶功能,并且只有少数特定的氨基酸 (AA) 实际上负责酶活性。这些必需氨基酸应该属于活性位点,并且有效的功能预测方法应该能够识别它们。