Laboratoire de Biochimie (CNRS UMR7654), Ecole Polytechnique, Palaiseau, France.
Sci Rep. 2017 Nov 20;7(1):15873. doi: 10.1038/s41598-017-16221-8.
Gene pairs that overlap in their coding regions are rare except in viruses. They may occur transiently in gene creation and are of biotechnological interest. We have examined the possibility to encode an arbitrary pair of protein domains as a dual gene, with the shorter coding sequence completely embedded in the longer one. For 500 × 500 domain pairs (X, Y), we computationally designed homologous pairs (X', Y') coded this way, using an algorithm that provably maximizes the sequence similarity between (X', Y') and (X, Y). Three schemes were considered, with X' and Y' coded on the same or complementary strands. For 16% of the pairs, an overlapping coding exists where the level of homology of X', Y' to the natural proteins represents an E-value of 10 or better. Thus, for an arbitrary domain pair, it is surprisingly easy to design homologous sequences that can be encoded as a fully-overlapping gene pair. The algorithm is general and was used to design 200 triple genes, with three proteins encoded by the same DNA segment. The ease of design suggests overlapping genes may have occurred frequently in evolution and could be readily used to compress or constrain artificial genomes.
基因对在其编码区域中重叠的情况很少见,除了在病毒中。它们可能在基因创造过程中短暂出现,并且具有生物技术上的兴趣。我们已经研究了将任意一对蛋白质结构域编码为双基因的可能性,其中较短的编码序列完全嵌入较长的编码序列中。对于 500×500 个结构域对(X,Y),我们使用一种算法计算设计了同源对(X',Y'),该算法可以证明在(X',Y')和(X,Y)之间最大化序列相似性。考虑了三种方案,其中 X'和 Y'编码在相同或互补链上。对于 16%的对,存在重叠编码,其中 X',Y'与天然蛋白质的同源性水平表示 E 值为 10 或更好。因此,对于任意的结构域对,设计可以作为完全重叠基因对编码的同源序列非常容易。该算法是通用的,并用于设计 200 个三联基因,其中三个蛋白质由相同的 DNA 片段编码。设计的容易程度表明重叠基因在进化中可能经常发生,并且可以很容易地用于压缩或限制人工基因组。