Havlin S, Buldyrev S V, Goldberger A L, Mantegna R N, Peng C K, Simons M, Stanley H E
Center for Polymer Studies and Department of Physics, Boston University, MA 02215, USA.
Fractals. 1995 Jun;3(2):269-84. doi: 10.1142/s0218348x95000229.
We present evidence supporting the idea that the DNA sequence in genes containing noncoding regions is correlated, and that the correlation is remarkably long range--indeed, base pairs thousands of base pairs distant are correlated. We do not find such a long-range correlation in the coding regions of the gene. We resolve the problem of the "non-stationary" feature of the sequence of base pairs by applying a new algorithm called Detrended Fluctuation Analysis (DFA). We address the claim of Voss that there is no difference in the statistical properties of coding and noncoding regions of DNA by systematically applying the DFA algorithm, as well as standard FFT analysis, to all eukaryotic DNA sequences (33 301 coding and 29 453 noncoding) in the entire GenBank database. We describe a simple model to account for the presence of long-range power-law correlations which is based upon a generalization of the classic Levy walk. Finally, we describe briefly some recent work showing that the noncoding sequences have certain statistical features in common with natural languages. Specifically, we adapt to DNA the Zipf approach to analyzing linguistic texts, and the Shannon approach to quantifying the "redundancy" of a linguistic text in terms of a measurable entropy function. We suggest that noncoding regions in plants and invertebrates may display a smaller entropy and larger redundancy than coding regions, further supporting the possibility that noncoding regions of DNA may carry biological information.
包含非编码区的基因中的DNA序列是相关的,并且这种相关性具有显著的长程——实际上,相距数千个碱基对的碱基对是相关的。我们在基因的编码区中没有发现这种长程相关性。我们通过应用一种名为去趋势波动分析(DFA)的新算法解决了碱基对序列“非平稳”特征的问题。我们通过将DFA算法以及标准的快速傅里叶变换(FFT)分析系统地应用于整个GenBank数据库中的所有真核生物DNA序列(33301个编码序列和29453个非编码序列),来回应沃斯关于DNA编码区和非编码区统计特性没有差异的说法。我们描述了一个简单的模型来解释长程幂律相关性的存在,该模型基于经典列维游走的推广。最后,我们简要描述了一些最近的研究工作,这些工作表明非编码序列具有与自然语言某些共同的统计特征。具体来说,我们将分析语言文本的齐普夫方法和用可测量的熵函数量化语言文本“冗余度”的香农方法应用于DNA。我们认为植物和无脊椎动物的非编码区可能比编码区表现出更小的熵和更大的冗余度,这进一步支持了DNA非编码区可能携带生物信息的可能性。