Zhang Chun-Ting, Zhang Ren
Department of Physics, Tianjin University, Tianjin 300072, China.
Comput Biol Chem. 2004 Apr;28(2):149-53. doi: 10.1016/j.compbiolchem.2004.02.002.
Let a, c, g and t denote the occurrence frequencies of A, C, G and T, respectively, in a genome. We calculated the statistical quantity S = a2 + c2 + g2 + t2 for each of 809 genomes (11 archaea, 42 bacteria, 3 eukaryota, 90 phages, 36 viroids and 627 viruses) and 236 plasmids. We found that S < 1/3 is strictly valid for almost all of the above genomes or plasmids. As a direct deduction of the above observation, it is shown that (i) the statistical quantity S is a kind of genome order index, which is negatively correlated with the Shannon H function; (ii) S < 1/3 suggests that a minimal value of the Shannon H function is required for each genome; (iii) S defined above would be a new biological statistical quantity, useful to describe the composition features of genomes; (iv) By jointly considering the Chargaff Parity Rule 2, it is shown that the genomic G + C content should be in between 0.211 and 0.789.
令a、c、g和t分别表示基因组中A、C、G和T的出现频率。我们计算了809个基因组(11个古细菌、42个细菌、3个真核生物、90个噬菌体、36个类病毒和627个病毒)以及236个质粒中每个的统计量S = a² + c² + g² + t²。我们发现,对于几乎所有上述基因组或质粒,S < 1/3严格成立。作为上述观察结果的直接推论,表明:(i) 统计量S是一种基因组有序指数,与香农H函数呈负相关;(ii) S < 1/3表明每个基因组需要香农H函数的一个最小值;(iii) 上述定义的S将是一个新的生物学统计量,有助于描述基因组的组成特征;(iv) 通过联合考虑查加夫第二奇偶规则,表明基因组的G + C含量应在0.211和0.789之间。