Aliperti Car Lucio, Sánchez Ignacio E
Instituto de Química Biológica de La Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Universidad de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas, Buenos Aires, Argentina.
J Mol Evol. 2025 May 20. doi: 10.1007/s00239-025-10251-x.
Encoding of protein-coding sequences in a genome through evolution leads to characteristic proportions of codons and amino acids. Here, we present a simplified maximum entropy model that groups together codons with the same GC (guanine + cytosine) content and coding for the same amino acid and accounts for the stoichiometry of genetic elements in over 50000 genomes with seven interpretable parameters. Our model includes both the cost of a codon given a genomic GC content and the metabolic cost of the corresponding amino acid. Both costs are essential for accurate prediction of codon and amino acid abundances. The best implementation of the model includes a universal equilibrium value for the genomic GC content below 50%, as suggested by the literature. It also splits the twenty amino acids in two groups forming strong (bases C and G) or weak (bases A and U) Watson Crick base pairs with the anticodon, differing in the strength of GC-dependent selection. The entropy-cost trade-off suggests that each organism has sorted out the genome encoding problem given a value for its genomic GC content. The empirical boundaries to this trade-off suggest minimal values for the amino acid and codon entropies, which may limit the GC content of natural genomes.
通过进化,基因组中蛋白质编码序列的编码会导致密码子和氨基酸呈现出特定的比例。在此,我们提出了一个简化的最大熵模型,该模型将具有相同GC(鸟嘌呤 + 胞嘧啶)含量且编码相同氨基酸的密码子归为一组,并利用七个可解释的参数来解释超过50000个基因组中遗传元件的化学计量。我们的模型既包括给定基因组GC含量时密码子的成本,也包括相应氨基酸的代谢成本。这两种成本对于准确预测密码子和氨基酸丰度都至关重要。正如文献所建议的,该模型的最佳实现包括基因组GC含量低于50%时的通用平衡值。它还将二十种氨基酸分为两组,这两组与反密码子形成强(碱基C和G)或弱(碱基A和U)的沃森 - 克里克碱基对,在GC依赖选择的强度上有所不同。熵 - 成本权衡表明,每种生物在给定其基因组GC含量值的情况下,都解决了基因组编码问题。这种权衡的经验边界表明了氨基酸和密码子熵的最小值,这可能会限制天然基因组的GC含量。