Carol Yu Centre for Infection, Department of Microbiology, The University of Hong Kong, Hong Kong, China.
BMC Genomics. 2010 Sep 9;11:491. doi: 10.1186/1471-2164-11-491.
Out-of-frame stop codons (OSCs) occur naturally in coding sequences of all organisms, providing a mechanism of early termination of translation in incorrect reading frame so that the metabolic cost associated with frameshift events can be reduced. Given such a functional significance, we expect statistically overrepresented OSCs in coding sequences as a result of a widespread selection. Accordingly, we examined available prokaryotic genomes to look for evidence of this selection.
The complete genome sequences of 990 prokaryotes were obtained from NCBI GenBank. We found that low G+C content coding sequences contain significantly more OSCs and G+C content at specific codon positions were the principal determinants of OSC usage bias in the different reading frames. To investigate if there is overrepresentation of OSCs, we modeled the trinucleotide and hexanucleotide biases of the coding sequences using Markov models, and calculated the expected OSC frequencies for each organism using a Monte Carlo approach. More than 93% of 342 phylogenetically representative prokaryotic genomes contain excess OSCs. Interestingly the degree of OSC overrepresentation correlates positively with G+C content, which may represent a compensatory mechanism for the negative correlation of OSC frequency with G+C content. We extended the analysis using additional compositional bias models and showed that lower-order bias like codon usage and dipeptide bias could not explain the OSC overrepresentation. The degree of OSC overrepresentation was found to correlate negatively with the optimal growth temperature of the organism after correcting for the G+C% and AT skew of the coding sequence.
The present study uses approaches with statistical rigor to show that OSC overrepresentation is a widespread phenomenon among prokaryotes. Our results support the hypothesis that OSCs carry functional significance and have been selected in the course of genome evolution to act against unintended frameshift occurrences. Some results also hint that OSC overrepresentation being a compensatory mechanism to make up for the decrease in OSCs in high G+C organisms, thus revealing the interplay between two different determinants of OSC frequency.
失读框终止密码子(OSC)自然存在于所有生物体的编码序列中,为在错误阅读框架中提前终止翻译提供了一种机制,从而降低与移码事件相关的代谢成本。鉴于这种功能意义,我们预计在编码序列中会出现统计上过度表示的 OSC,这是广泛选择的结果。因此,我们检查了可用的原核基因组,以寻找这种选择的证据。
从 NCBI GenBank 获得了 990 个原核生物的完整基因组序列。我们发现,低 G+C 含量编码序列中含有明显更多的 OSC,并且特定密码子位置的 G+C 含量是不同阅读框架中 OSC 使用偏好的主要决定因素。为了研究是否存在 OSC 的过度表示,我们使用马尔可夫模型对编码序列的三核苷酸和六核苷酸偏倚进行建模,并使用蒙特卡罗方法为每个生物体计算预期的 OSC 频率。超过 93%的 342 个系统发育代表性原核生物基因组含有过多的 OSC。有趣的是,OSC 过度表示的程度与 G+C 含量呈正相关,这可能代表了 OSC 频率与 G+C 含量呈负相关的补偿机制。我们使用额外的组成偏倚模型扩展了分析,并表明像密码子使用和二肽偏倚这样的低阶偏倚不能解释 OSC 的过度表示。在纠正编码序列的 G+C%和 AT 倾斜后,发现 OSC 过度表示的程度与生物体的最佳生长温度呈负相关。
本研究使用具有统计严谨性的方法表明,OSC 过度表示是原核生物中普遍存在的现象。我们的结果支持这样的假设,即 OSC 具有功能意义,并在基因组进化过程中被选择以对抗意外的移码发生。一些结果还暗示,OSC 过度表示是一种补偿机制,可以弥补高 G+C 生物体中 OSC 的减少,从而揭示了两种不同 OSC 频率决定因素之间的相互作用。