Kummerfeld Sarah K, Teichmann Sarah A
Department of Developmental Biology, 279 Campus Dr, Stanford, 94305, CA, USA.
BMC Bioinformatics. 2009 Jan 29;10:39. doi: 10.1186/1471-2105-10-39.
Domains are the building blocks of proteins. During evolution, they have been duplicated, fused and recombined, to produce proteins with novel structures and functions. Structural and genome-scale studies have shown that pairs or groups of domains observed together in a protein are almost always found in only one N to C terminal order and are the result of a single recombination event that has been propagated by duplication of the multi-domain unit. Previous studies of domain organisation have used graph theory to represent the co-occurrence of domains within proteins. We build on this approach by adding directionality to the graphs and connecting nodes based on their relative order in the protein. Most of the time, the linear order of domains is conserved. However, using the directed graph representation we have identified non-linear features of domain organization that are over-represented in genomes. Recognising these patterns and unravelling how they have arisen may allow us to understand the functional relationships between domains and understand how the protein repertoire has evolved.
We identify groups of domains that are not linearly conserved, but instead have been shuffled during evolution so that they occur in multiple different orders. We consider 192 genomes across all three kingdoms of life and use domain and protein annotation to understand their functional significance. To identify these features and assess their statistical significance, we represent the linear order of domains in proteins as a directed graph and apply graph theoretical methods. We describe two higher-order patterns of domain organisation: clusters and bi-directionally associated domain pairs and explore their functional importance and phylogenetic conservation.
Taking into account the order of domains, we have derived a novel picture of global protein organization. We found that all genomes have a higher than expected degree of clustering and more domain pairs in forward and reverse orientation in different proteins relative to random graphs with identical degree distributions. While these features were statistically over-represented, they are still fairly rare. Looking in detail at the proteins involved, we found strong functional relationships within each cluster. In addition, the domains tended to be involved in protein-protein interaction and are able to function as independent structural units. A particularly striking example was the human Jak-STAT signalling pathway which makes use of a set of domains in a range of orders and orientations to provide nuanced signaling functionality. This illustrated the importance of functional and structural constraints (or lack thereof) on domain organisation.
结构域是蛋白质的构建模块。在进化过程中,它们经历了复制、融合和重组,从而产生具有新结构和功能的蛋白质。结构和基因组规模的研究表明,在蛋白质中共同出现的成对或成组结构域几乎总是仅以一种从N端到C端的顺序被发现,并且是单个重组事件的结果,该事件通过多结构域单元的复制得以传播。先前对结构域组织的研究使用图论来表示蛋白质中结构域的共现情况。我们在此方法的基础上,通过为图添加方向性并根据结构域在蛋白质中的相对顺序连接节点。大多数情况下,结构域的线性顺序是保守的。然而,使用有向图表示法,我们识别出了在基因组中过度呈现的结构域组织的非线性特征。识别这些模式并弄清楚它们是如何产生的,可能使我们能够理解结构域之间的功能关系,并了解蛋白质库是如何进化的。
我们识别出了一些结构域组,它们在进化过程中并非线性保守,而是被洗牌,以至于它们以多种不同顺序出现。我们考虑了生命三界中的192个基因组,并利用结构域和蛋白质注释来理解它们的功能意义。为了识别这些特征并评估它们的统计显著性,我们将蛋白质中结构域的线性顺序表示为有向图,并应用图论方法。我们描述了两种高阶结构域组织模式:簇和双向关联的结构域对,并探讨了它们的功能重要性和系统发育保守性。
考虑到结构域的顺序,我们得出了一幅关于全球蛋白质组织的新图景。我们发现,相对于具有相同度分布的随机图,所有基因组都具有高于预期的聚类程度,并且在不同蛋白质中具有更多正向和反向排列的结构域对。虽然这些特征在统计上过度呈现,但它们仍然相当罕见。详细查看所涉及的蛋白质时,我们发现每个簇内都有很强的功能关系。此外,这些结构域倾向于参与蛋白质-蛋白质相互作用,并且能够作为独立的结构单元发挥作用。一个特别引人注目的例子是人类Jak-STAT信号通路,它利用一组以一系列顺序和方向排列的结构域来提供细微的信号功能。这说明了功能和结构限制(或缺乏这些限制)对结构域组织的重要性。