Institute for Evolution and Biodiversity, Westfalian Wilhelms University Muenster, Germany.
Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, France.
FEBS J. 2018 Jul;285(14):2605-2625. doi: 10.1111/febs.14504. Epub 2018 Jun 29.
Over long time scales, protein evolution is characterized by modular rearrangements of protein domains. Such rearrangements are mainly caused by gene duplication, fusion and terminal losses. To better understand domain emergence mechanisms we investigated 32 insect genomes covering a speciation gradient ranging from ~ 2 to ~ 390 mya. We use established domain models and foldable domains delineated by hydrophobic cluster analysis (HCA), which does not require homologous sequences, to also identify domains which have likely arisen de novo, that is, from previously noncoding DNA. Our results indicate that most novel domains emerge terminally as they originate from ORF extensions while fewer arise in middle arrangements, resulting from exonization of intronic or intergenic regions. Many novel domains rapidly migrate between terminal or middle positions and single- and multidomain arrangements. Young domains, such as most HCA-defined domains, are under strong selection pressure as they show signals of purifying selection. De novo domains, linked to ancient domains or defined by HCA, have higher degrees of intrinsic disorder and disorder-to-order transition upon binding than ancient domains. However, the corresponding DNA sequences of the novel domains of de novo origins could only rarely be found in sister genomes. We conclude that novel domains are often recruited by other proteins and undergo important structural modifications shortly after their emergence, but evolve too fast to be characterized by cross-species comparisons alone.
在长时间尺度上,蛋白质进化的特征是蛋白质结构域的模块化重排。这种重排主要是由基因复制、融合和末端缺失引起的。为了更好地理解结构域出现的机制,我们研究了 32 种昆虫基因组,涵盖了从2 到390 百万年前的物种形成梯度。我们使用已建立的结构域模型和通过疏水簇分析(HCA)划定的可折叠结构域,HCA 不需要同源序列,来识别可能是从头出现的结构域,即来自先前的非编码 DNA。我们的研究结果表明,大多数新出现的结构域是作为 ORF 扩展的末端出现的,而较少的结构域是从中部排列出现的,这是由于内含子或基因间区域的外显子化引起的。许多新出现的结构域在末端或中间位置之间以及单结构域和多结构域排列之间快速迁移。年轻的结构域,如大多数 HCA 定义的结构域,受到强烈的选择压力,因为它们表现出纯化选择的信号。与古老结构域相关联或由 HCA 定义的新出现的结构域具有更高的内在无序度和结合时的无序到有序转变程度,而古老结构域则没有。然而,新出现的结构域的 DNA 序列只能在姐妹基因组中很少被发现。我们的结论是,新出现的结构域通常被其他蛋白质招募,并在出现后不久经历重要的结构修饰,但进化速度太快,仅凭跨物种比较无法确定。