Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia.
PLoS Genet. 2010 Jun 24;6(6):e1001004. doi: 10.1371/journal.pgen.1001004.
Codon usage bias in prokaryotic genomes is largely a consequence of background substitution patterns in DNA, but highly expressed genes may show a preference towards codons that enable more efficient and/or accurate translation. We introduce a novel approach based on supervised machine learning that detects effects of translational selection on genes, while controlling for local variation in nucleotide substitution patterns represented as sequence composition of intergenic DNA. A cornerstone of our method is a Random Forest classifier that outperformed previous distance measure-based approaches, such as the codon adaptation index, in the task of discerning the (highly expressed) ribosomal protein genes by their codon frequencies. Unlike previous reports, we show evidence that translational selection in prokaryotes is practically universal: in 460 of 461 examined microbial genomes, we find that a subset of genes shows a higher codon usage similarity to the ribosomal proteins than would be expected from the local sequence composition. These genes constitute a substantial part of the genome--between 5% and 33%, depending on genome size--while also exhibiting higher experimentally measured mRNA abundances and tending toward codons that match tRNA anticodons by canonical base pairing. Certain gene functional categories are generally enriched with, or depleted of codon-optimized genes, the trends of enrichment/depletion being conserved between Archaea and Bacteria. Prominent exceptions from these trends might indicate genes with alternative physiological roles; we speculate on specific examples related to detoxication of oxygen radicals and ammonia and to possible misannotations of asparaginyl-tRNA synthetases. Since the presence of codon optimizations on genes is a valid proxy for expression levels in fully sequenced genomes, we provide an example of an "adaptome" by highlighting gene functions with expression levels elevated specifically in thermophilic Bacteria and Archaea.
原核生物基因组中的密码子使用偏好主要是 DNA 背景替换模式的结果,但高表达基因可能更倾向于那些能够实现更高效和/或更准确翻译的密码子。我们引入了一种新的基于监督机器学习的方法,可以检测翻译选择对基因的影响,同时控制基因间 DNA 序列组成所代表的核苷酸替换模式的局部变化。我们方法的一个基石是随机森林分类器,它在区分核糖体蛋白基因的密码子频率方面的表现优于以前基于距离度量的方法,如密码子适应指数。与以前的报告不同,我们提供了证据表明原核生物中的翻译选择实际上是普遍存在的:在 461 个被检查的微生物基因组中,我们发现一组基因与核糖体蛋白的密码子使用相似度高于局部序列组成所预期的。这些基因构成了基因组的重要部分——取决于基因组大小,在 5%到 33%之间——同时也表现出更高的实验测量的 mRNA 丰度,并倾向于与 tRNA 反密码子通过规范碱基配对匹配的密码子。某些基因功能类别通常富含或缺乏密码子优化基因,这种富集/耗尽的趋势在古菌和细菌之间是保守的。这些趋势的突出例外可能表明存在替代生理作用的基因;我们推测了与氧自由基和氨解毒以及天冬酰胺-tRNA 合成酶可能的错误注释相关的具体例子。由于在完全测序的基因组中,基因上存在密码子优化是表达水平的有效代理,因此我们通过突出在嗜热细菌和古菌中特异性表达水平升高的基因功能,提供了一个“适应组”的示例。