Suppr超能文献

编码和非编码DNA序列的长程相关性:GenBank分析。

Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis.

作者信息

Buldyrev S V, Goldberger A L, Havlin S, Mantegna R N, Matsa M E, Peng C K, Simons M, Stanley H E

机构信息

Deparment of Physics, Boston University, Massachusetts 02215, USA.

出版信息

Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1995 May;51(5):5084-91. doi: 10.1103/physreve.51.5084.

Abstract

An open question in computational molecular biology is whether long-range correlations are present in both coding and noncoding DNA or only in the latter. To answer this question, we consider all 33301 coding and all 29453 noncoding eukaryotic sequences--each of length larger than 512 base pairs (bp)--in the present release of the GenBank to dtermine whether there is any statistically significant distinction in their long-range correlation properties. Standard fast Fourier transform (FFT) analysis indicates that coding sequences have practically no correlations in the range from 10 bp to 100 bp (spectral exponent beta=0.00 +/- 0.04, where the uncertainty is two standard deviations). In contrast, for noncoding sequences, the average value of the spectral exponent beta is positive (0.16 +/- 0.05) which unambiguously shows the presence of long-range correlations. We also separately analyze the 874 coding and the 1157 noncoding sequences that have more than 4096 bp and find a larger region of power-law behavior. We calculate the probability that these two data sets (coding and noncoding) were drawn from the same distribution and we find that it is less than 10(-10). We obtain independent confirmation of these findings using the method of detrended fluctuation analysis (DFA), which is designed to treat sequences with statistical heterogeneity, such as DNA's known mosaic structure ("patchiness") arising from the nonstationarity of nucleotide concentration. The near-perfect agreement between the two independent analysis methods, FFT and DFA, increases the confidence in the reliability of our conclusion.

摘要

计算分子生物学中的一个开放性问题是,长程相关性是同时存在于编码DNA和非编码DNA中,还是仅存在于后者中。为了回答这个问题,我们考虑了GenBank当前版本中所有33301个编码真核序列和所有29453个非编码真核序列(每个序列长度大于512个碱基对),以确定它们在长程相关性特性上是否存在任何统计学上的显著差异。标准快速傅里叶变换(FFT)分析表明,编码序列在10个碱基对到100个碱基对的范围内实际上没有相关性(频谱指数β = 0.00 ± 0.04,其中不确定性为两个标准差)。相比之下,对于非编码序列,频谱指数β的平均值为正(0.16 ± 0.05),这明确表明存在长程相关性。我们还分别分析了长度超过4096个碱基对的874个编码序列和1157个非编码序列,发现了更大的幂律行为区域。我们计算了这两个数据集(编码和非编码)来自同一分布的概率,发现该概率小于10^(-10)。我们使用去趋势波动分析(DFA)方法获得了这些发现的独立验证,该方法旨在处理具有统计异质性的序列,例如由于核苷酸浓度的非平稳性而产生的DNA已知镶嵌结构(“斑块性”)。两种独立分析方法FFT和DFA之间近乎完美的一致性增加了我们对结论可靠性的信心。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验