Arquès D G, Michel C J
Université de Marne-la-Vallée, Institut Gaspard Monge, France.
J Theor Biol. 1996 Sep 7;182(1):45-58. doi: 10.1006/jtbi.1996.0142.
Recently, shifted periodicities 1 modulo 3 and 2 modulo 3 have been identified in protein (coding) genes of both prokaryotes and eukaryotes with autocorrelation functions analysing eight of 64 trinucleotides (Arquès et al., 1995). This observation suggests that the trinucleotides are associated with frames in protein genes. In order to verify this hypothesis, a distribution of the 64 trinucleotides AAA,..., TTT is studied in both gene populations by using a simple method based on the trinucleotide frequencies per frame. In protein genes, the trinucleotides can be read in three frames: the reading frame 0 established by the ATG start trinucleotide and frame 1 (resp. 2) which is the frame 0 shifted by 1 (resp. 2) nucleotide in the 5'-3' direction. Then, the occurrence frequencies of the 64 trinucleotides are computed in the three frames. By classifying each of the 64 trinucleotides in its preferential occurrence frame, i.e. the frame associated with its highest frequency, three subsets of trinucleotides can be identified in the three frames. This approach is applied in the two gene populations. Unexpectedly, the same three subsets of trinucleotides are identified in these two gene populations: Tzero = Xzero [symbol: see text] {AAA,TTT} with Xzero = {AAC,AAT,ACC,ATC,ATT, CAG,CTC,CTG,GAA,GAC,GAG, GAT,GCC,GGC,GGT,GTA,GTC,GTT,TAC,TTC} in frame 0, T1 = X1 [symbol: see text] {CCC} in frame 1 and T2 = X2 [symbol: see text] {GGG} in frame 2, each subset Xzero, X1 and X2 having 20 trinucleotides. Surprisingly, these three subsets have five important properties: (i) the property of maximal circular code for Xzero (resp. X1, X2) allowing the automatical retrieval of frame 0 (resp. 1, 2) in any region of a protein gene model (formed by a series of trinucleotides of Xzero) without using a start codon; (ii) the DNA complementarity property C (e.g. C(AAC) = GTT): C(T0) = T0, C(T1) = T2 and C(T2) = T1 allowing the two paired reading frames of a DNA double helix simultaneously to code for amino acids; (iii) the circular permutation property P (e.g. P(AAC) = ACA): P(Xzero) = X1 and P(X1) = X2 implying that the two subsets X1 and X2 can be deduced from Xzero; (iv) the rarity property with an occurrence probability of Xzero equal to 6 x 10(-8); and (v) the concatenation property with: a high frequency (27.5%) of misplaced trinucleotides in the shifted frames, a maximum (13 nucleotides) length of the minimal window to automatically retrieve the frame and an occurrence of the four types of nucleotides in the three trinucleotides sites, in favour of an evolutionary code. In the Discussion, the identified subsets Tzero, T1 and T2 replaced in the three two-letter genetic alphabets purine/pyrimidine, amino/ceto and strong/weak interaction, allow us to deduce that the RNY model (R = purine = A or G, Y = pyrimidine = C or T, N = R or Y) (Eigen & Schuster, 1978) is the closest two-letter codon model to the trinucleotides of Tzero. Then, these three subsets are related to the genetic code. The trinucleotides of Tzero code for 13 amino acids: Ala, Asn, Asp, Gln, Glu, Gly, Ile, Leu, Lys, Phe, Thr, Tyr and Val. Finally, a strong correlation between the usage of the trinucleotides of Tzero in protein genes and the amino acid frequencies in proteins is observed as six among seven amino acids not coded by Tzero, have as expected the lowest frequencies in proteins of both prokaryotes and eukaryotes.
最近,通过对64种三核苷酸中的8种进行自相关函数分析,在原核生物和真核生物的蛋白质(编码)基因中都发现了模3余1和模3余2的移位周期性(阿尔凯斯等人,1995年)。这一观察结果表明,三核苷酸与蛋白质基因中的阅读框相关。为了验证这一假设,我们使用一种基于每帧三核苷酸频率的简单方法,研究了这两种基因群体中64种三核苷酸AAA,...,TTT的分布情况。在蛋白质基因中,三核苷酸可以在三个阅读框中读取:由ATG起始三核苷酸建立的阅读框0,以及在5'-3'方向上相对于阅读框0分别移位1个(或2个)核苷酸的阅读框1(或2)。然后,计算这64种三核苷酸在三个阅读框中的出现频率。通过将64种三核苷酸中的每一种分类到其优先出现的阅读框中,即与其最高频率相关的阅读框,可以在三个阅读框中识别出三核苷酸的三个子集。这种方法应用于这两种基因群体。出乎意料的是,在这两种基因群体中识别出了相同的三个三核苷酸子集:在阅读框0中,Tzero = Xzero [符号:见正文] {AAA,TTT},其中Xzero = {AAC,AAT,ACC,ATC,ATT, CAG,CTC,CTG,GAA,GAC,GAG, GAT,GCC,GGC,GGT,GTA,GTC,GTT,TAC,TTC};在阅读框1中,T1 = X1 [符号:见正文] {CCC};在阅读框2中,T2 = X2 [符号:见正文] {GGG},每个子集Xzero、X1和X2都有20种三核苷酸。令人惊讶的是,这三个子集具有五个重要特性:(i)Xzero(或X1、X2)的最大循环码特性,允许在蛋白质基因模型的任何区域(由Xzero的一系列三核苷酸组成)中自动检索阅读框0(或1、2),而无需使用起始密码子;(ii)DNA互补性特性C(例如C(AAC) = GTT):C(T0) = T0,C(T1) = T2,C(T2) = T1,这使得DNA双螺旋的两个配对阅读框能够同时编码氨基酸;(iii)循环置换特性P(例如P(AAC) = ACA):P(Xzero) = X1且P(X1) = Xz2,这意味着可以从Xzero推导出两个子集X1和X2;(iv)稀有性特性,Xzero的出现概率等于6×10^(-8);(v)连接特性为:在移位阅读框中错配三核苷酸的频率较高(27.5%),自动检索阅读框的最小窗口的最大长度为(13个核苷酸),并且在三个三核苷酸位点中出现四种核苷酸类型,这有利于进化密码。在讨论中,在嘌呤/嘧啶、氨基/酮基和强/弱相互作用这三个双字母遗传字母表中替换所识别的子集Tzero、T1和T2,使我们能够推断出RNY模型(R = 嘌呤 = A或G,Y = 嘧啶 = C或T,N = R或Y)(艾根和舒斯特,1978年)是与Tzero的三核苷酸最接近的双字母密码子模型。然后,这三个子集与遗传密码相关。Tzero的三核苷酸编码13种氨基酸:丙氨酸、天冬酰胺、天冬氨酸、谷氨酰胺、谷氨酸、甘氨酸、异亮氨酸、亮氨酸、赖氨酸、苯丙氨酸、苏氨酸、酪氨酸和缬氨酸。最后,观察到蛋白质基因中Tzero的三核苷酸使用情况与蛋白质中的氨基酸频率之间存在很强的相关性,因为在Tzero不编码的七种氨基酸中,有六种在原核生物和真核生物的蛋白质中具有预期的最低频率。