McKusick - Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.
Biol Direct. 2010 Feb 17;5:10. doi: 10.1186/1745-6150-5-10.
The Z-curve is a three dimensional representation of DNA sequences proposed over a decade ago and has been extensively applied to sequence segmentation, horizontal gene transfer detection, and sequence analysis. Based on the Z-curve, a "genome order index," was proposed, which is defined as S = a2+ c2+t2+g2, where a, c, t, and g are the nucleotide frequencies of A, C, T, and G, respectively. This index was found to be smaller than 1/3 for almost all tested genomes, which was taken as support for the existence of a constraint on genome composition. A geometric explanation for this constraint has been suggested. Each genome was represented by a point P whose distance from the four faces of a regular tetrahedron was given by the frequencies a, c, t, and g. They claimed that an inscribed sphere of radius r = 1/ square root 3 contains almost all points corresponding to various genomes, implying that S <r2. The distribution of the points P obtained by S was studied using the Z-curve.
In this work, we studied the basic properties of the Z-curve using the "genome order index" as a case study. We show that (1) the calculation of the radius of the inscribed sphere of a regular tetrahedron is incorrect, (2) the S index is narrowly distributed, (3) based on the second parity rule, the S index can be derived directly from the Shannon entropy and is, therefore, redundant, and (4) the Z-curve suffers from over dimensionality, and the dimension stands for GC content alone suffices to represent any given genome.
The "genome order index" S does not represent a constraint on nucleotide composition. Moreover, S can be easily computed from the Gini-Simpson index and be directly derived from entropy and is redundant. Overall, the Z-curve and S are over-complicated measures to GC content and Shannon H index, respectively.
Z 曲线是一种三维表示 DNA 序列的方法,十多年前提出,已广泛应用于序列分割、水平基因转移检测和序列分析。基于 Z 曲线,提出了“基因组顺序指数”,定义为 S = a2+ c2+t2+g2,其中 a、c、t 和 g 分别是 A、C、T 和 G 的核苷酸频率。几乎所有测试的基因组的这个指数都小于 1/3,这被认为是基因组组成存在约束的支持。已经提出了这种约束的几何解释。每个基因组都由一个点 P 表示,该点 P 距离正四面体的四个面的距离由频率 a、c、t 和 g 给出。他们声称,半径为 r = 1/ square root 3 的内接球几乎包含了对应于各种基因组的所有点,这意味着 S <r2。使用 Z 曲线研究了通过 S 获得的点 P 的分布。
在这项工作中,我们使用“基因组顺序指数”作为案例研究,研究了 Z 曲线的基本性质。我们表明:(1)正四面体内接球半径的计算是不正确的;(2)S 指数分布较窄;(3)基于第二奇偶规则,S 指数可以直接从香农熵推导出来,因此是多余的;(4)Z 曲线存在过度维数,并且维度仅足以代表任何给定的基因组的 GC 含量。
“基因组顺序指数”S 不代表核苷酸组成的约束。此外,S 可以从基尼-辛普森指数轻松计算,并直接从熵推导出来,因此是多余的。总体而言,Z 曲线和 S 分别是 GC 含量和香农 H 指数的过度复杂的度量。