DNA序列的统计特性。

Statistical properties of DNA sequences.

作者信息

Peng C K, Buldyrev S V, Goldberger A L, Havlin S, Mantegna R N, Simons M, Stanley H E

机构信息

Cardiovascular Division, Harvard Medical School, Boston, MA 02215, USA.

出版信息

Physica A. 1995;221:180-92. doi: 10.1016/0378-4371(95)00247-5.

DOI:10.1016/0378-4371(95)00247-5

PMID:11540495

Abstract

We review evidence supporting the idea that the DNA sequence in genes containing non-coding regions is correlated, and that the correlation is remarkably long range--indeed, nucleotides thousands of base pairs distant are correlated. We do not find such a long-range correlation in the coding regions of the gene. We resolve the problem of the "non-stationarity" feature of the sequence of base pairs by applying a new algorithm called detrended fluctuation analysis (DFA). We address the claim of Voss that there is no difference in the statistical properties of coding and non-coding regions of DNA by systematically applying the DFA algorithm, as well as standard FFT analysis, to every DNA sequence (33301 coding and 29453 non-coding) in the entire GenBank database. Finally, we describe briefly some recent work showing that the non-coding sequences have certain statistical features in common with natural and artificial languages. Specifically, we adapt to DNA the Zipf approach to analyzing linguistic texts. These statistical properties of non-coding sequences support the possibility that non-coding regions of DNA may carry biological information.

摘要

我们回顾了支持以下观点的证据

包含非编码区的基因中的DNA序列是相关的，并且这种相关性具有显著的长程性——实际上，数千个碱基对之外的核苷酸是相关的。我们在基因的编码区中未发现这种长程相关性。我们通过应用一种称为去趋势波动分析（DFA）的新算法，解决了碱基对序列的“非平稳性”特征问题。我们通过将DFA算法以及标准的快速傅里叶变换（FFT）分析系统地应用于整个GenBank数据库中的每一个DNA序列（33301个编码序列和29453个非编码序列），回应了沃斯关于DNA编码区和非编码区的统计特性没有差异的说法。最后，我们简要描述了一些近期的研究工作，这些工作表明非编码序列具有与自然语言和人工语言某些共同的统计特征。具体而言，我们将分析语言文本的齐普夫方法应用于DNA。非编码序列的这些统计特性支持了DNA非编码区可能携带生物信息的可能性。