Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
Broad Institute of MIT and Harvard, Cambridge, MA, USA.
Mol Biol Evol. 2019 Oct 1;36(10):2328-2339. doi: 10.1093/molbev/msz124.
Because of the degeneracy of the genetic code, multiple codons are translated into the same amino acid. Despite being "synonymous," these codons are not equally used. Selective pressures are thought to drive the choice among synonymous codons within a genome, while GC content, which is typically attributed to mutational drift, is the major determinant of variation across species. Here, we find that in addition to GC content, interspecies codon usage signatures can also be detected. More specifically, we show that a single amino acid, arginine, is the major contributor to codon usage bias differences across domains of life. We then exploit this finding and show that domain-specific codon bias signatures can be used to classify a given sequence into its corresponding domain of life with high accuracy. We then wondered whether the inclusion of codon usage codon autocorrelation patterns, which reflects the nonrandom distribution of codon occurrences throughout a transcript, might improve the classification performance of our algorithm. However, we find that autocorrelation patterns are not domain-specific, and surprisingly, are unrelated to tRNA reusage, in contrast to previous reports. Instead, our results suggest that codon autocorrelation patterns are a by-product of codon optimality throughout a sequence, where highly expressed genes display autocorrelated "optimal" codons, whereas lowly expressed genes display autocorrelated "nonoptimal" codons.
由于遗传密码的简并性,多个密码子被翻译成相同的氨基酸。尽管这些密码子是“同义的”,但它们的使用并不完全相同。选择压力被认为是驱动基因组中同义密码子选择的原因,而 GC 含量,通常归因于突变漂移,是物种间变异的主要决定因素。在这里,我们发现除了 GC 含量之外,还可以检测到种间密码子使用特征。更具体地说,我们表明,单个氨基酸精氨酸是导致生命各领域密码子使用偏好差异的主要因素。然后,我们利用这一发现表明,特定于域的密码子偏倚特征可以用于高精度地将给定序列分类到其相应的生命领域。然后我们想知道是否包含密码子使用密码子自相关模式会提高我们算法的分类性能,密码子自相关模式反映了密码子在整个转录本中的非随机分布。然而,我们发现自相关模式不是特定于域的,并且与以前的报告相反,与 tRNA 再利用无关。相反,我们的结果表明,密码子自相关模式是序列中密码子最优性的副产品,其中高表达基因显示出自相关的“最优”密码子,而低表达基因显示出自相关的“非最优”密码子。