Hegyi H, Gerstein M
Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA.
Genome Res. 2001 Oct;11(10):1632-40. doi: 10.1101/gr.183801.
Annotation transfer is a principal process in genome annotation. It involves "transferring" structural and functional annotation to uncharacterized open reading frames (ORFs) in a newly completed genome from experimentally characterized proteins similar in sequence. To prevent errors in genome annotation, it is important that this process be robust and statistically well-characterized, especially with regard to how it depends on the degree of sequence similarity. Previously, we and others have analyzed annotation transfer in single-domain proteins. Multi-domain proteins, which make up the bulk of the ORFs in eukaryotic genomes, present more complex issues in functional conservation. Here we present a large-scale survey of annotation transfer in these proteins, using scop superfamilies to define domain folds and a thesaurus based on SWISS-PROT keywords to define functional categories. Our survey reveals that multi-domain proteins have significantly less functional conservation than single-domain ones, except when they share the exact same combination of domain folds. In particular, we find that for multi-domain proteins, approximate function can be accurately transferred with only 35% certainty for pairs of proteins sharing one structural superfamily. In contrast, this value is 67% for pairs of single-domain proteins sharing the same structural superfamily. On the other hand, if two multi-domain proteins contain the same combination of two structural superfamilies the probability of their sharing the same function increases to 80% in the case of complete coverage along the full length of both proteins, this value increases further to > 90%. Moreover, we found that only 70 of the current total of 455 structural superfamilies are found in both single and multi-domain proteins and only 14 of these were associated with the same function in both categories of proteins. We also investigated the degree to which function could be transferred between pairs of multi-domain proteins with respect to the degree of sequence similarity between them, finding that functional divergence at a given amount of sequence similarity is always about two-fold greater for pairs of multi-domain proteins (sharing similarity over a single domain) in comparison to pairs of single-domain ones, though the overall shape of the relationship is quite similar. Further information is available at http://partslist.org/func or http://bioinfo.mbb.yale.edu/partslist/func.
注释转移是基因组注释中的一个主要过程。它涉及将结构和功能注释从序列相似的经过实验表征的蛋白质“转移”到新完成基因组中未表征的开放阅读框(ORF)。为防止基因组注释中的错误,该过程稳健且在统计学上有良好表征非常重要,特别是在其如何依赖于序列相似程度方面。此前,我们和其他人已分析了单结构域蛋白质中的注释转移。构成真核生物基因组中大部分ORF的多结构域蛋白质,在功能保守方面存在更复杂的问题。在此,我们对这些蛋白质中的注释转移进行了大规模调查,使用SCOP超家族来定义结构域折叠,并基于SWISS-PROT关键词的词库来定义功能类别。我们的调查显示,多结构域蛋白质相比单结构域蛋白质具有显著更少的功能保守性,除非它们共享完全相同的结构域折叠组合。特别是,我们发现对于多结构域蛋白质,对于共享一个结构超家族的蛋白质对,近似功能只有35%的确定性能够准确转移。相比之下,共享相同结构超家族的单结构域蛋白质对的这一值为67%。另一方面,如果两个多结构域蛋白质包含两个结构超家族的相同组合,在两个蛋白质全长完全覆盖的情况下,它们共享相同功能的概率增加到80%,此值进一步增加到>90%。此外,我们发现当前455个结构超家族中只有70个同时存在于单结构域和多结构域蛋白质中,其中只有14个在这两类蛋白质中与相同功能相关。我们还研究了多结构域蛋白质对之间功能转移的程度与它们之间序列相似程度的关系,发现对于多结构域蛋白质对(在单个结构域上共享相似性),在给定序列相似量下的功能差异总是比单结构域蛋白质对大约两倍,尽管这种关系的整体形状非常相似。更多信息可在http://partslist.org/func或http://bioinfo.mbb.yale.edu/partslist/func获取。