Poland Douglas
Department of Chemistry, The Johns Hopkins University, Baltimore, MD 21218, USA.
Biophys Chem. 2003 Dec 1;106(3):275-303. doi: 10.1016/s0301-4622(03)00213-8.
In this paper we explore the free energy distribution in the helical form of DNA using the genome of the virus Rickettsia prowazekii Madrid E as an example. The genome of this organism has been determined by Andersson et al. (Nature 396 (1998) 133) and is available on the World Wide Web (www.tigr.org). Using the helix statistical weights based on nearest-neighbor base pairs of SantaLucia (Proc. Natl. Acad. Sci. USA 95 (1998) 1460), we calculate the free energy in consecutive blocks of m base pairs in the DNA sequence and then construct the free energy distribution for these values. Using the maximum-entropy method we can fit the distribution curves with a function based on the moments of the distribution. For blocks containing 10-20 base pairs the distribution is slightly skewed and we require four moments to accurately fit the function. For blocks containing 100 base pairs or more, the distribution is well approximated by a Gaussian function based on the first two moments of the distribution. We find that the free energy distribution for m=20 can be reproduced using random sequences that have the local (singlet, doublet or triplet) statistics of Rickettsia. However, for much larger blocks, for example m=500, the width of the free energy distribution based on the actual Rickettsia genome is broader by almost a factor of 3 than the distributions based on random local statistics. We find that the distribution functions for the C or G content in blocks of m base pairs have almost the same behavior as a function of block size as do the free energy distributions. In order to duplicate the width of the distribution functions based on the actual Rickettsia sequence, we need to introduce tables (matrices) that correlate the states of consecutive blocks hundreds of base pairs long. This indicates that correlations on the order of the number of base pairs contained in the average gene are required to give the actual widths for either the C or G content or the helix free energy distributions. Above a certain m value, the distributions for larger m can be accurately expressed in terms of the distribution functions for smaller m. Thus, for example, the distribution for m=5000 can be expressed in terms of the generating function for m=1000.
在本文中,我们以普氏立克次氏体马德里E株病毒的基因组为例,探讨了DNA螺旋形式中的自由能分布。该生物体的基因组已由安德森等人测定(《自然》396卷(1998年)第133页),并可在万维网(www.tigr.org)上获取。利用基于圣卢西亚最近邻碱基对的螺旋统计权重(《美国国家科学院院刊》95卷(1998年)第1460页),我们计算了DNA序列中m个碱基对连续块的自由能,然后构建了这些值的自由能分布。使用最大熵方法,我们可以用基于分布矩的函数拟合分布曲线。对于包含10 - 20个碱基对的块,分布略有偏斜,我们需要四个矩来准确拟合函数。对于包含100个碱基对或更多的块,基于分布前两个矩的高斯函数能很好地近似该分布。我们发现,使用具有普氏立克次氏体局部(单碱基、双碱基或三碱基)统计特征的随机序列,可以重现m = 20时的自由能分布。然而,对于大得多的块,例如m = 500,基于实际普氏立克次氏体基因组的自由能分布宽度比基于随机局部统计的分布宽近3倍。我们发现,m个碱基对块中C或G含量的分布函数与自由能分布作为块大小的函数具有几乎相同的行为。为了复制基于实际普氏立克次氏体序列的分布函数宽度,我们需要引入将数百个碱基对长的连续块状态相关联的表格(矩阵)。这表明,需要平均基因中所含碱基对数数量级的相关性,才能给出C或G含量或螺旋自由能分布的实际宽度。在某个m值以上,较大m值的分布可以用较小m值的分布函数准确表示。因此,例如,m = 5000的分布可以用m = 1000的生成函数表示。