Gusein-Zade S M
Institute of Molecular Genetics, USSR Academy of Sciences, Moscow.
J Biomol Struct Dyn. 1989 Apr;6(5):1001-12. doi: 10.1080/07391102.1989.10506527.
Information science widely uses descriptions of the distribution of information units (words) according to the frequency of occurrence with the help of a corresponding ranged series, i.e., the sequence of occurrence frequencies p1, p2, ..., pr as taken in decreasing order. A model called the Zipf rule or Zipflaw is the most commonly used. In this model pr is inversly proportional to a certain degree of range r: pr = C/r2 (C, z greater than 0). Upon analysis, the correspondence of codon distribution and the Zipf model is found unsatisfactory. The distribution of letters (in English and some other languages) by the occurrence frequency does not obey the Zipf rule either. A new model is proposed for a similar distribution in which pr = C.(ln(n + 1)-ln r), where n is the quantity of various symbols (codons). This dependence is approximated by a straight line not in the co-ordinate system (ln r, ln p), like the Zipf model, but in the (ln r, p) system of co-ordinates. It is shown on the basis of statistical criteria that this model is in good agreement with the ranged series of codon frequencies for the best-studied genoms to date. This result may be regarded as an additional reason in favor of the codon-letter analogy (not the codon-word analogy) in genetic texts.
信息科学在相应的区间序列的帮助下,广泛使用根据出现频率对信息单元(单词)分布的描述,即按降序排列的出现频率序列p1、p2、...、pr。一种称为齐普夫规则或齐普夫定律的模型是最常用的。在该模型中,pr与某个区间r的一定程度成反比:pr = C/r²(C、r大于0)。经过分析,发现密码子分布与齐普夫模型的对应关系并不理想。字母(在英语和其他一些语言中)按出现频率的分布也不遵循齐普夫规则。针对类似的分布提出了一种新模型,其中pr = C·(ln(n + 1) - ln r),其中n是各种符号(密码子)的数量。这种依赖关系不像齐普夫模型那样在坐标系统(ln r, ln p)中由一条直线近似,而是在(ln r, p)坐标系统中。基于统计标准表明,该模型与迄今为止研究最充分的基因组的密码子频率区间序列高度吻合。这一结果可被视为支持遗传文本中密码子-字母类比(而非密码子-单词类比)的另一个理由。