Doig A J
Department of Biomolecular Sciences, UMIST, Manchester, M60 1QD, U.K.
J Theor Biol. 1997 Oct 7;188(3):355-60. doi: 10.1006/jtbi.1997.0489.
The function of DNA is to specify protein sequences. The four-base "alphabet" used in nucleic acids is translated to the 20 base alphabet of proteins (plus a stop signal) via the genetic code. The code is neither overlapping nor punctuated, but has mRNA sequences read in successive triplet codons until reaching a stop codon. The true genetic code uses three bases for every amino acid. The efficiency of the genetic code can be significantly increased if the requirement for a fixed codon length is dropped so that the more common amino acids have shorter codon lengths and rare amino acids have longer codon lengths. More efficient codes can be derived using the Shannon-Fano and Huffman coding algorithms. The compression achieved using a Huffman code cannot be improved upon. I have used these algorithms to derive efficient codes for representing protein sequences using both two and four bases. The length of DNA required to specify the complete set of protein sequences could be significantly shorter if transcription used a variable codon length. The restriction to a fixed codon length of three bases means that it takes 42% more DNA than the minimum necessary, and the genetic code is 70% efficient. One can think of many reasons why this maximally efficient code has not evolved: there is very little redundancy so almost any mutation causes an amino acid change. Many mutations will be potentially lethal frame-shift mutations, if the mutation leads to a change in codon length. It would be more difficult for the machinery of transcription to cope with a variable codon length. Nevertheless, in the strict and narrow sense of coding for protein sequences using the minimum length of DNA possible, the Huffman code derived here is perfect.
DNA的功能是指定蛋白质序列。核酸中使用的四碱基“字母表”通过遗传密码被翻译成蛋白质的20碱基字母表(加上一个终止信号)。该密码既不重叠也无标点,而是以连续的三联体密码子读取mRNA序列,直到到达终止密码子。真正的遗传密码每个氨基酸使用三个碱基。如果放弃对固定密码子长度的要求,使较常见的氨基酸具有较短的密码子长度,而罕见氨基酸具有较长的密码子长度,遗传密码的效率可以显著提高。使用香农 - 法诺编码算法和哈夫曼编码算法可以得到更高效的编码。使用哈夫曼编码实现的压缩效果无法再改进。我已使用这些算法得出了用两个碱基和四个碱基表示蛋白质序列的高效编码。如果转录使用可变密码子长度,指定完整蛋白质序列集所需的DNA长度可能会显著缩短。对固定为三个碱基的密码子长度的限制意味着所需的DNA比最低必要量多42%,并且遗传密码的效率为70%。人们可以想出许多原因来解释为什么这种最高效的密码没有进化:几乎没有冗余,所以几乎任何突变都会导致氨基酸变化。如果突变导致密码子长度改变,许多突变将是潜在致命的移码突变。转录机制要应对可变密码子长度会更加困难。然而,从使用尽可能短的DNA长度对蛋白质序列进行编码的严格和狭义意义上讲,这里得出的哈夫曼编码是完美的。