Zielinski Jerzy S, Bouaynaya Nidhal, Schonfeld Dan, O'Neill William
Department of Systems Engineering, University of Arkansas at Little Rock, Little Rock, AR, USA.
BMC Bioinformatics. 2008 Aug 12;9 Suppl 9(Suppl 9):S14. doi: 10.1186/1471-2105-9-S9-S14.
Over the past decade, many investigators have used sophisticated time series tools for the analysis of genomic sequences. Specifically, the correlation of the nucleotide chain has been studied by examining the properties of the power spectrum. The main limitation of the power spectrum is that it is restricted to stationary time series. However, it has been observed over the past decade that genomic sequences exhibit non-stationary statistical behavior. Standard statistical tests have been used to verify that the genomic sequences are indeed not stationary. More recent analysis of genomic data has relied on time-varying power spectral methods to capture the statistical characteristics of genomic sequences. Techniques such as the evolutionary spectrum and evolutionary periodogram have been successful in extracting the time-varying correlation structure. The main difficulty in using time-varying spectral methods is that they are extremely unstable. Large deviations in the correlation structure results from very minor perturbations in the genomic data and experimental procedure. A fundamental new approach is needed in order to provide a stable platform for the non-stationary statistical analysis of genomic sequences.
In this paper, we propose to model non-stationary genomic sequences by a time-dependent autoregressive moving average (TD-ARMA) process. The model is based on a classical ARMA process whose coefficients are allowed to vary with time. A series expansion of the time-varying coefficients is used to form a generalized Yule-Walker-type system of equations. A recursive least-squares algorithm is subsequently used to estimate the time-dependent coefficients of the model. The non-stationary parameters estimated are used as a basis for statistical inference and biophysical interpretation of genomic data. In particular, we rely on the TD-ARMA model of genomic sequences to investigate the statistical properties and differentiate between coding and non-coding regions in the nucleotide chain. Specifically, we define a quantitative measure of randomness to assess how far a process deviates from white noise. Our simulation results on various gene sequences show that both the coding and non-coding regions are non-random. However, coding sequences are "whiter" than non-coding sequences as attested by a higher index of randomness.
We demonstrate that the proposed TD-ARMA model can be used to provide a stable time series tool for the analysis of non-stationary genomic sequences. The estimated time-varying coefficients are used to define an index of randomness, in order to assess the statistical correlations in coding and non-coding DNA sequences. It turns out that the statistical differences between coding and non-coding sequences are more subtle than previously thought using stationary analysis tools: Both coding and non-coding sequences exhibit statistical correlations, with the coding regions being "whiter" than the non-coding regions. These results corroborate the evolutionary periodogram analysis of genomic sequences and revoke the stationary analysis' conclusion that coding DNA behaves like random sequences.
在过去十年中,许多研究人员使用复杂的时间序列工具来分析基因组序列。具体而言,通过检查功率谱的特性来研究核苷酸链的相关性。功率谱的主要局限性在于它仅限于平稳时间序列。然而,在过去十年中观察到基因组序列呈现出非平稳的统计行为。已使用标准统计检验来验证基因组序列确实不是平稳的。最近对基因组数据的分析依赖于时变功率谱方法来捕捉基因组序列的统计特征。诸如进化谱和进化周期图等技术已成功提取时变相关结构。使用时变谱方法的主要困难在于它们极其不稳定。基因组数据和实验过程中非常微小的扰动会导致相关结构出现大的偏差。需要一种全新的基本方法来为基因组序列的非平稳统计分析提供一个稳定的平台。
在本文中,我们提议通过一个随时间变化的自回归移动平均(TD - ARMA)过程对非平稳基因组序列进行建模。该模型基于经典的ARMA过程,其系数允许随时间变化。使用时变系数的级数展开来形成一个广义的尤尔 - 沃克型方程组。随后使用递归最小二乘算法来估计模型的时变系数。估计出的非平稳参数被用作基因组数据统计推断和生物物理解释的基础。特别是,我们依靠基因组序列的TD - ARMA模型来研究统计特性,并区分核苷酸链中的编码区和非编码区。具体来说,我们定义了一种随机性的定量度量,以评估一个过程偏离白噪声的程度。我们在各种基因序列上的模拟结果表明,编码区和非编码区都是非随机的。然而,如更高的随机性指数所证明的,编码序列比非编码序列“更白”。
我们证明所提出的TD - ARMA模型可用于为非平稳基因组序列的分析提供一个稳定的时间序列工具。估计出的时变系数用于定义一个随机性指数,以评估编码和非编码DNA序列中的统计相关性。事实证明,编码序列和非编码序列之间的统计差异比使用平稳分析工具之前认为的更为微妙:编码序列和非编码序列都表现出统计相关性,编码区比非编码区“更白”。这些结果证实了基因组序列的进化周期图分析,并推翻了平稳分析得出的编码DNA表现得像随机序列的结论。