Bernaola-Galván P, Oliver J L, Hackenberg M, Coronado A V, Ivanov P Ch, Carpena P
Dpto. de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain.
Eur Phys J B. 2012 Jun 1;85(6). doi: 10.1140/epjb/e2012-20969-5.
Segmentation is a standard method of data analysis to identify change-points dividing a nonstationary time series into homogeneous segments. However, for long-range fractal correlated series, most of the segmentation techniques detect spurious change-points which are simply due to the heterogeneities induced by the correlations and not to real nonstationarities. To avoid this oversegmentation, we present a segmentation algorithm which takes as a reference for homogeneity, instead of a random i.i.d. series, a correlated series modeled by a fractional noise with the same degree of correlations as the series to be segmented. We apply our algorithm to artificial series with long-range correlations and show that it systematically detects only the change-points produced by real nonstationarities and not those created by the correlations of the signal. Further, we apply the method to the sequence of the long arm of human chromosome 21, which is known to have long-range fractal correlations. We obtain only three segments that clearly correspond to the three regions of different G + C composition revealed by means of a multi-scale wavelet plot. Similar results have been obtained when segmenting all human chromosome sequences, showing the existence of previously unknown huge compositional superstructures in the human genome.
分割是一种数据分析的标准方法,用于识别将非平稳时间序列划分为同质段的变化点。然而,对于长程分形相关序列,大多数分割技术会检测到虚假的变化点,这些变化点仅仅是由相关性引起的异质性导致的,而不是真正的非平稳性。为了避免这种过度分割,我们提出了一种分割算法,该算法以一个相关序列作为同质性的参考,而不是一个随机独立同分布序列,该相关序列由与待分割序列具有相同相关程度的分数噪声建模。我们将我们的算法应用于具有长程相关性的人工序列,并表明它系统地只检测到由真正的非平稳性产生的变化点,而不是由信号相关性产生的变化点。此外,我们将该方法应用于人类21号染色体长臂的序列,已知该序列具有长程分形相关性。我们只得到了三个段,它们清楚地对应于通过多尺度小波图揭示的不同G + C组成的三个区域。在分割所有人类染色体序列时也得到了类似的结果,这表明人类基因组中存在以前未知的巨大组成超结构。