DNA中的统计有序尺度。

Department of Chemistry, The Johns Hopkins University, Baltimore, MD 21218, USA.

Biophys Chem. 2009 May;141(2-3):203-13. doi: 10.1016/j.bpc.2009.02.003. Epub 2009 Feb 20.

In the present paper we examine the statistics of occurrence of A-T and C-G base pairs in DNA. We focus on the net base composition in blocks of base pairs of various sizes. This paper extends our previous work on randomness and order in DNA sequences and examines order on various scales. For structure on the local scale (10(0)-10(1) bp) we have seen that the net base composition in given block sizes is fitted very accurately by the discrete binomial distribution for a random system. If the statistics were random for larger block sizes then the appropriate distribution would be the standard normal (Gaussian) distribution which is the continuous analog of the discrete binomial distribution. However, we have found that at the intermediate scale (10(2)-10(4) bp) the composition distribution is not fit by a standard normal distribution but rather by a modified normal distribution with a standard deviation that is a marked nonrandom function of block size. In particular, the standard deviation accurately follows a power law with a characteristic exponent. This behavior can be interpreted in terms of a random walk model due to Mandelbrot that is characterized by a tendency for the walk to persist in direction. The DNA analog of the walk model is the tendency of blocks of base pairs with a given net composition to be followed by blocks of a similar composition (persistence of composition). A model based on a generating function constructed from a matrix of conditional probabilities (incorporating persistence) explains the overall order in a given genome at the intermediate scale. In the present paper we examine the block statistics in DNA using the genomes of two organisms, namely Bacillus anthracis and Escherichia coli both of which have a chain length of slightly over five million base pairs. We find that the distributions in B. anthracis are well fit by a Mandelbrot-like distribution. On the other hand, the distributions in E. coli are not so well fit by this distribution which is based on two moments. Using the maximum-entropy method we construct an improved distribution for E. coli based on four moments. Finally we look at the order on the scale of the entire molecule (global scale). Applying the model of a random walk to the complete DNA genome we find that the Mandelbrot distribution on an intermediate level cannot explain the global character of the random walk, there being structure to the walk with features on the scale of the total length of the molecule (10(5)-10(7) bp). To understand the three scales of order (local, intermediate and global) we construct a model sequence based on the incorporation of Mandelbrot-type order on the intermediate scale in a single size block. We then find that the character of the order on the local and global scales follows naturally from this single feature. Thus all three scales of order in DNA are incorporated into our model sequence.

在本文中，我们研究了DNA中A-T和C-G碱基对出现的统计情况。我们关注不同大小碱基对片段中的净碱基组成。本文扩展了我们之前关于DNA序列随机性和有序性的工作，并研究了不同尺度上的有序性。对于局部尺度（10⁰ - 10¹ 碱基对）的结构，我们发现给定片段大小的净碱基组成可以由随机系统的离散二项分布非常精确地拟合。如果对于更大片段大小的统计是随机的，那么合适的分布将是标准正态（高斯）分布，它是离散二项分布的连续类似物。然而，我们发现，在中间尺度（10² - 10⁴ 碱基对），组成分布并不符合标准正态分布，而是符合一种修正的正态分布，其标准差是片段大小的显著非随机函数。特别地，标准差精确地遵循具有特征指数的幂律。这种行为可以用曼德勃罗提出的随机游走模型来解释，该模型的特征是游走倾向于在方向上持续。游走模型的DNA类似物是具有给定净组成的碱基对片段倾向于被具有相似组成的片段跟随（组成持续性）。基于由条件概率矩阵（包含持续性）构建的生成函数的模型解释了中间尺度上给定基因组中的整体有序性。在本文中，我们使用两种生物体的基因组，即炭疽芽孢杆菌和大肠杆菌，来研究DNA中的片段统计情况，这两种生物体的链长都略超过五百万碱基对。我们发现炭疽芽孢杆菌中的分布很好地符合类似曼德勃罗的分布。另一方面，大肠杆菌中的分布不太符合基于两个矩量的这种分布。使用最大熵方法，我们基于四个矩量为大肠杆菌构建了一种改进的分布。最后，我们研究了整个分子尺度（全局尺度）上的有序性。将随机游走模型应用于完整的DNA基因组，我们发现中间水平的曼德勃罗分布无法解释随机游走的全局特征，游走存在结构，其特征出现在分子总长度（10⁵ - 10⁷ 碱基对）的尺度上。为了理解有序性的三个尺度（局部、中间和全局），我们基于在单个大小片段中纳入中间尺度的曼德勃罗型有序性构建了一个模型序列。然后我们发现局部和全局尺度上的有序性特征自然地源于这个单一特征。因此，DNA中有序性的所有三个尺度都被纳入到我们的模型序列中。