El Soufi Karim, Michel Christian J
Theoretical Bioinformatics, ICube, University of Strasbourg, CNRS, 300 Boulevard Sébastien Brant, 67400 Illkirch, France.
Biosystems. 2017 Mar-Apr;153-154:45-62. doi: 10.1016/j.biosystems.2017.02.001. Epub 2017 Feb 24.
A set X of 20 trinucleotides was identified in genes of bacteria, eukaryotes, plasmids and viruses, which has in average the highest occurrence in reading frame compared to its two shifted frames (Michel, 2015; Arquès and Michel, 1996). This set X has an interesting mathematical property as X is a circular code (Arquès and Michel, 1996). Thus, the motifs from this circular code X, called X motifs, have the property to always retrieve, synchronize and maintain the reading frame in genes. The origin of this circular code X in genes is an open problem since its discovery in 1996. Here, we first show that the unitary circular codes (UCC), i.e. sets of one word, allow to generate unitary circular code motifs (UCC motifs), i.e. a concatenation of the same motif (simple repeats) leading to low complexity DNA. Three classes of UCC motifs are studied here: repeated dinucleotides (D motifs), repeated trinucleotides (T motifs) and repeated tetranucleotides (T motifs). Thus, the D, T and T motifs allow to retrieve, synchronize and maintain a frame modulo 2, modulo 3 and modulo 4, respectively, and their shifted frames (1 modulo 2; 1 and 2 modulo 3; 1, 2 and 3 modulo 4 according to the C, C and C properties, respectively) in the DNA sequences. The statistical distribution of the D, T and T motifs is analyzed in the genomes of eukaryotes. A UCC motif and its comp lementary UCC motif have the same distribution in the eukaryotic genomes. Furthermore, a UCC motif and its complementary UCC motif have increasing occurrences contrary to their number of hydrogen bonds, very significant with the T motifs. The longest D, T and T motifs in the studied eukaryotic genomes are also given. Surprisingly, a scarcity of repeated trinucleotides (T motifs) in the large eukaryotic genomes is observed compared to the D and T motifs. This result has been investigated and may be explained by two outcomes. Repeated trinucleotides (T motifs) are identified in the X motifs of low composition (cardinality less than 10) in the genomes of eukaryotes. Furthermore, identical trinucleotide pairs of the circular code X are preferentially used in the gene sequences of eukaryotes. These two results suggest that the unitary circular codes of trinucleotides may have been involved in the formation of the trinucleotide circular code X. Indeed, repeated trinucleotides in the X motifs in the genomes of eukaryotes may represent an intermediary evolution from repeated trinucleotides of cardinality 1 (T motifs) in the genomes of eukaryotes up to the X motifs of cardinality 20 in the gene sequences of eukaryotes.
在细菌、真核生物、质粒和病毒的基因中鉴定出一组由20个三核苷酸组成的集合X,与它的两个移码相比,该集合在阅读框中的出现频率平均最高(米歇尔,2015年;阿尔凯斯和米歇尔,1996年)。集合X具有一个有趣的数学性质,即X是一个循环码(阿尔凯斯和米歇尔,1996年)。因此,这个循环码X中的基序,称为X基序,具有在基因中始终检索、同步和维持阅读框的性质。自1996年发现以来,基因中这个循环码X的起源一直是一个悬而未决的问题。在这里,我们首先表明,单一循环码(UCC),即一个单词的集合,允许生成单一循环码基序(UCC基序),即相同基序的串联(简单重复),从而导致低复杂性DNA。这里研究了三类UCC基序:重复二核苷酸(D基序)、重复三核苷酸(T基序)和重复四核苷酸(T基序)。因此,D基序、T基序和T基序分别允许在DNA序列中检索、同步和维持模2、模3和模4的阅读框及其移码(分别根据C、C和C性质为模2余1;模3余1和余2;模4余1、余2和余3)。分析了真核生物基因组中D基序、T基序和T基序的统计分布。一个UCC基序及其互补UCC基序在真核生物基因组中具有相同的分布。此外,一个UCC基序及其互补UCC基序的出现频率增加,这与它们的氢键数量相反,对于T基序来说非常显著。还给出了所研究的真核生物基因组中最长的D基序、T基序和T基序。令人惊讶的是,与D基序和T基序相比,在大型真核生物基因组中观察到重复三核苷酸(T基序)的稀缺。对这一结果进行了研究,可能由两个结果来解释。在真核生物基因组中,重复三核苷酸(T基序)存在于组成较低(基数小于1〇)的X基序中。此外,循环码X中相同的三核苷酸对在真核生物的基因序列中被优先使用。这两个结果表明,三核苷酸的单一循环码可能参与了三核苷酸循环码X的形成。实际上,真核生物基因组中X基序中的重复三核苷酸可能代表了从真核生物基因组中基数为1的重复三核苷酸(T基序)到真核生物基因序列中基数为2〇的X基序的中间进化过程。