Shakhnovich Boris E, Max Harvey J
Bioinformatics Program, Boston University, Boston, MA 02215, USA.
J Mol Biol. 2004 Apr 2;337(4):933-49. doi: 10.1016/j.jmb.2004.02.009.
Since the advent of investigations into structural genomics, research has focused on correctly identifying domain boundaries, as well as domain similarities and differences in the context of their evolutionary relationships. As the science of structural genomics ramps up adding more and more information into the databanks, questions about the accuracy and completeness of our classification and annotation systems appear on the forefront of this research. A central question of paramount importance is how structural similarity relates to functional similarity. Here, we begin to rigorously and quantitatively answer these questions by first exploring the consensus between the most common protein domain structure annotation databases CATH, SCOP and FSSP. Each of these databases explores the evolutionary relationships between protein domains using a combination of automatic and manual, structural and functional, continuous and discrete similarity measures. In order to examine the issue of consensus thoroughly, we build a generalized graph out of each of these databases and hierarchically cluster these graphs at interval thresholds. We then employ a distance measure to find regions of greatest overlap. Using this procedure we were able not only to enumerate the level of consensus between the different annotation systems, but also to define the graph-theoretical origins behind the annotation schema of class, family and superfamily by observing that the same thresholds that define the best consensus regions between FSSP, SCOP and CATH correspond to distinct, non-random phase-transitions in the structure comparison graph itself. To investigate the correspondence in divergence between structure and function further, we introduce a measure of functional entropy that calculates divergence in function space. First, we use this measure to calculate the general correlation between structural homology and functional proximity. We extend this analysis further by quantitatively calculating the average amount of functional information gained from our understanding of structural distance and the corollary inherent uncertainty that represents the theoretical limit of our ability to infer function from structural similarity. Finally we show how our measure of functional "entropy" translates into a more intuitive concept of functional annotation into similarity EC classes.
自开展结构基因组学研究以来,研究重点一直是在进化关系背景下正确识别结构域边界以及结构域的异同。随着结构基因组学这门科学不断发展,向数据库中添加越来越多的信息,我们分类和注释系统的准确性和完整性问题成为了这项研究的前沿问题。一个至关重要的核心问题是结构相似性与功能相似性之间的关系。在此,我们首先通过探索最常见的蛋白质结构域结构注释数据库CATH、SCOP和FSSP之间的一致性,开始严格且定量地回答这些问题。这些数据库中的每一个都使用自动与手动、结构与功能、连续与离散相似性度量的组合来探索蛋白质结构域之间的进化关系。为了全面研究一致性问题,我们从这些数据库中的每一个构建一个广义图,并在间隔阈值下对这些图进行层次聚类。然后我们使用距离度量来找到重叠度最高的区域。通过这个过程,我们不仅能够列举不同注释系统之间的一致程度,还能够通过观察到定义FSSP、SCOP和CATH之间最佳一致区域的相同阈值对应于结构比较图本身中不同的、非随机的相变,来定义类、家族和超家族注释模式背后的图论起源。为了进一步研究结构与功能差异之间的对应关系,我们引入了一种功能熵度量,用于计算功能空间中的差异。首先,我们使用这个度量来计算结构同源性与功能接近性之间的一般相关性。我们通过定量计算从对结构距离的理解中获得的功能信息的平均量以及代表从结构相似性推断功能能力理论极限的必然固有不确定性,进一步扩展了这一分析。最后,我们展示了我们的功能“熵”度量如何转化为功能注释到相似性EC类的更直观概念。