School of Computing, University of Kent, Canterbury CT2 7NF, UK.
Kent Fungal Group, School of Biosciences, University of Kent, Canterbury CT2 7NJ, UK.
J R Soc Interface. 2020 Feb;17(163):20190819. doi: 10.1098/rsif.2019.0819. Epub 2020 Feb 19.
The genetic code is necessarily degenerate with 64 possible nucleotide triplets being translated into 20 amino acids. Eighteen out of the 20 amino acids are encoded by multiple synonymous codons. While synonymous codons are clearly equivalent in terms of the information they carry, it is now well established that they are used in a biased fashion. There is currently no consensus as to the origin of this bias. Drawing on ideas from stochastic thermodynamics we derive from first principles a mathematical model describing the statistics of codon usage bias. We show that the model accurately describes the distribution of codon usage bias of genomes in the fungal and bacterial kingdoms. Based on it, we derive a new computational measure of codon usage bias-the distance capturing two aspects of codon usage bias: (i) differences in the genome-wide frequency of codons and (ii) apparent non-random distributions of codons across mRNAs. By means of large scale computational analysis of over 900 species across two kingdoms of life, we demonstrate that our measure provides novel biological insights. Specifically, we show that while codon usage bias is clearly based on heritable traits and closely related species show similar degrees of bias, there is considerable variation in the magnitude of within taxonomic classes suggesting that the contribution of sequence-level selection to codon bias varies substantially within relatively confined taxonomic groups. Interestingly, commonly used model organisms are near the median for values of for their taxonomic class, suggesting that they may not be good representative models for species with more extreme , which comprise organisms of medical and agricultural interest. We also demonstrate that amino acid specific patterns of codon usage are themselves quite variable between branches of the tree of life, and that some of this variability correlates with organismal tRNA content.
遗传密码必然是简并的,64 种可能的三核苷酸密码子翻译成 20 种氨基酸。20 种氨基酸中有 18 种是由多个同义密码子编码的。虽然同义密码子在其所携带的信息方面是等效的,但现在已经明确的是,它们是以有偏向的方式被使用的。目前,对于这种偏向的起源还没有共识。我们借鉴随机热力学的思想,从第一性原理出发,推导出一个描述密码子使用偏倚统计的数学模型。我们表明,该模型准确地描述了真菌和细菌王国基因组中密码子使用偏倚的分布。在此基础上,我们推导出了一种新的计算度量,即距离,它捕捉了密码子使用偏倚的两个方面:(i)密码子在全基因组频率上的差异,以及(ii)密码子在 mRNAs 上的明显非随机分布。通过对两个生命王国的 900 多个物种进行大规模的计算分析,我们证明了我们的度量提供了新的生物学见解。具体来说,我们表明,虽然密码子使用偏倚显然是基于可遗传的特征,并且密切相关的物种表现出相似程度的偏倚,但在分类群内 的幅度有相当大的变化,这表明序列水平选择对密码子偏倚的贡献在相对局限的分类群内有很大的变化。有趣的是,常用的模式生物在其分类类群的 值中接近中位数,这表明它们可能不是具有更极端 的物种的良好代表模型,而这些物种包括具有医学和农业利益的生物体。我们还表明,密码子使用的氨基酸特异性模式在生命之树的分支之间本身就非常多变,并且这种可变性与生物体的 tRNA 含量有些相关。