Gorban A N, Zinovyev A Y
University of Leicester, Leicester, UK.
Bull Math Biol. 2007 Oct;69(7):2429-42. doi: 10.1007/s11538-007-9229-6. Epub 2007 Jun 19.
In special coordinates (codon position-specific nucleotide frequencies), bacterial genomes form two straight lines in 9-dimensional space: one line for eubacterial genomes, another for archaeal genomes. All the 348 distinct bacterial genomes available in Genbank in April 2007, belong to these lines with high accuracy. The main challenge now is to explain the observed high accuracy. The new phenomenon of complementary symmetry for codon position-specific nucleotide frequencies is observed. The results of analysis of several codon usage models are presented. We demonstrate that the mean-field approximation, which is also known as context-free, or complete independence model, or Segre variety, can serve as a reasonable approximation to the real codon usage. The first two principal components of codon usage correlate strongly with genomic G+C content and the optimal growth temperature, respectively. The variation of codon usage along the third component is related to the curvature of the mean-field approximation. First three eigenvalues in codon usage PCA explain 59.1%, 7.8% and 4.7% of variation. The eubacterial and archaeal genomes codon usage is clearly distributed along two third order curves with genomic G+C content as a parameter.
在特殊坐标(密码子位置特异性核苷酸频率)中,细菌基因组在九维空间中形成两条直线:一条是真细菌基因组的直线,另一条是古细菌基因组的直线。2007年4月Genbank中所有348个不同的细菌基因组都高度精确地属于这些直线。现在的主要挑战是解释观察到的高精度。观察到了密码子位置特异性核苷酸频率的互补对称性这一新现象。给出了几种密码子使用模型的分析结果。我们证明,平均场近似,也称为无上下文或完全独立模型或塞格雷变种,可以作为对实际密码子使用的合理近似。密码子使用的前两个主成分分别与基因组G+C含量和最佳生长温度密切相关。沿着第三个成分的密码子使用变化与平均场近似的曲率有关。密码子使用主成分分析中的前三个特征值分别解释了59.1%、7.8%和4.7%的变化。真细菌和古细菌基因组的密码子使用以基因组G+C含量为参数,明显沿着两条三阶曲线分布。