State Key Laboratory of Paleobiology and Stratigraphy, Nanjing Institute of Geology and Palaeontology, Chinese Academy of Science, Nanjing, China.
Mol Biol Evol. 2013 Jan;30(1):191-6. doi: 10.1093/molbev/mss201. Epub 2012 Aug 21.
The effective number of codons (N(c)) is a widely used index for characterizing codon usage bias because it does not require a set of reference genes as does codon adaptation index (CAI) and because of the freely available computational tools such as CodonW. However, N(c), as originally formulated has many problems. For example, it can have values far greater than the number of sense codons; it treats a 6-fold compound codon family as a single-codon family although it is made of a 2-fold and a 4-fold codon family that can be under dramatically different selection for codon usage bias; the existing implementations do not handle all different genetic codes; it is often biased by codon families with a small number of codons. We developed a new N(c) that has a number of advantages over the original N(c). Its maximum value equals the number of sense codons when all synonymous codons are used equally, and its minimum value equals the number of codon families when exactly one codon is used in each synonymous codon family. It handles all known genetic codes. It breaks the compound codon families (e.g., those involving amino acids coded by six synonymous codons) into 2-fold and 4-fold codon families. It reduces the effect of codon families with few codons by introducing pseudocount and weighted averages. The new N(c) has significantly improved correlation with CAI than the original N(c) from CodonW based on protein-coding genes from Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Escherichia coli, Bacillus subtilis, Micrococcus luteus, and Mycoplasma genitalium. It also correlates better with protein abundance data from the yeast than the original N(c).
有效密码子数(N(c))是一种广泛用于描述密码子使用偏好的指标,因为它不需要像密码子适应指数(CAI)那样使用一组参考基因,并且由于有 CodonW 等免费的计算工具。然而,最初提出的 N(c)存在许多问题。例如,它的值可以远远大于有义密码子的数量;它将 6 重复合密码子家族视为单密码子家族,尽管它由 2 重和 4 重密码子家族组成,这两种家族可能受到截然不同的密码子使用偏好的选择;现有的实现方式不能处理所有不同的遗传密码;它经常受到密码子家族数量较少的影响。我们开发了一种新的 N(c),它比原始的 N(c)具有许多优势。当所有同义密码子都被平等使用时,它的最大值等于有义密码子的数量,而当每个同义密码子家族都只使用一个密码子时,它的最小值等于密码子家族的数量。它可以处理所有已知的遗传密码。它将复合密码子家族(例如,涉及由六个同义密码子编码的氨基酸的家族)分解为 2 重和 4 重密码子家族。它通过引入伪计数和加权平均值来减少具有少数密码子的密码子家族的影响。新的 N(c)与原始的 N(c)相比,与基于酿酒酵母、秀丽隐杆线虫、黑腹果蝇、大肠杆菌、枯草芽孢杆菌、藤黄微球菌和生殖支原体蛋白质编码基因的 CodonW 产生的 CAI 的相关性显著提高。它与酵母的蛋白质丰度数据的相关性也优于原始的 N(c)。