Institute for Evolution and Biodiversity, 48149, Münster, Germany.
J Mol Evol. 2020 May;88(4):382-398. doi: 10.1007/s00239-020-09939-z. Epub 2020 Apr 7.
Orphan genes, lacking detectable homologs in outgroup species, typically represent 10-30% of eukaryotic genomes. Efforts to find the source of these young genes indicate that de novo emergence from non-coding DNA may in part explain their prevalence. Here, we investigate the roots of orphan gene emergence in the Drosophila genus. Across the annotated proteomes of twelve species, we find 6297 orphan genes within 4953 taxon-specific clusters of orthologs. By inferring the ancestral DNA as non-coding for between 550 and 2467 (8.7-39.2%) of these genes, we describe for the first time how de novo emergence contributes to the abundance of clade-specific Drosophila genes. In support of them having functional roles, we show that de novo genes have robust expression and translational support. However, the distinct nucleotide sequences of de novo genes, which have characteristics intermediate between intergenic regions and conserved genes, reflect their recent birth from non-coding DNA. We find that de novo genes encode more disordered proteins than both older genes and intergenic regions. Together, our results suggest that gene emergence from non-coding DNA provides an abundant source of material for the evolution of new proteins. Following gene birth, gradual evolution over large evolutionary timescales moulds sequence properties towards those of conserved genes, resulting in a continuum of properties whose starting points depend on the nucleotide sequences of an initial pool of novel genes.
孤儿基因在进化上与外群物种中没有可检测到的同源物,通常占真核生物基因组的 10-30%。寻找这些年轻基因来源的努力表明,从头从非编码 DNA 中出现可能部分解释了它们的普遍性。在这里,我们调查了果蝇属中孤儿基因出现的根源。在 12 个物种的注释蛋白质组中,我们在 4953 个分类群特异性直系同源物簇中发现了 6297 个孤儿基因。通过推断这些基因中的 550 到 2467 个(8.7-39.2%)的祖先 DNA 是非编码的,我们首次描述了从头出现如何导致特定于分支的果蝇基因的丰富。为了支持它们具有功能作用,我们表明从头基因具有稳健的表达和翻译支持。然而,从头基因的独特核苷酸序列,其介于基因间区和保守基因之间的特征,反映了它们最近从非编码 DNA 中诞生。我们发现,与较老的基因和基因间区相比,从头基因编码更多的无序蛋白质。总之,我们的结果表明,非编码 DNA 中的基因出现为新蛋白质的进化提供了丰富的物质来源。在基因诞生之后,随着时间的推移,序列特性逐渐进化,逐渐向保守基因的特性发展,形成了一个连续的特性,其起点取决于初始新基因池的核苷酸序列。