IEEE Trans Cybern. 2022 Jun;52(6):4173-4186. doi: 10.1109/TCYB.2020.3023973. Epub 2022 Jun 16.
In an era of ubiquitous large-scale evolving data streams, data stream clustering (DSC) has received lots of attention because the scale of the data streams far exceeds the ability of expert human analysts. It has been observed that high-dimensional data are usually distributed in a union of low-dimensional subspaces. In this article, we propose a novel sparse representation-based DSC algorithm, called evolutionary dynamic sparse subspace clustering (EDSSC). It can cope with the time-varying nature of subspaces underlying the evolving data streams, such as subspace emergence, disappearance, and recurrence. The proposed EDSSC consists of two phases: 1) static learning and 2) online clustering. During the first phase, a data structure for storing the statistic summary of data streams, called EDSSC summary, is proposed which can better address the dilemma between the two conflicting goals: 1) saving more points for accuracy of subspace clustering (SC) and 2) discarding more points for the efficiency of DSC. By further proposing an algorithm to estimate the subspace number, the proposed EDSSC does not need to know the number of subspaces. In the second phase, a more suitable index, called the average sparsity concentration index (ASCI), is proposed, which dramatically promotes the clustering accuracy compared to the conventionally utilized SCI index. In addition, the subspace evolution detection model based on the Page-Hinkley test is proposed where the appearing, disappearing, and recurring subspaces can be detected and adapted. Extinct experiments on real-world data streams show that the EDSSC outperforms the state-of-the-art online SC approaches.
在大规模数据流无处不在的时代,由于数据流的规模远远超出了专家分析人员的能力,因此数据流聚类(DSC)受到了广泛关注。已经观察到高维数据通常分布在低维子空间的并集中。在本文中,我们提出了一种新颖的基于稀疏表示的 DSC 算法,称为进化动态稀疏子空间聚类(EDSSC)。它可以处理基础数据流随时间变化的子空间的性质,例如子空间的出现、消失和重现。所提出的 EDSSC 由两个阶段组成:1)静态学习和 2)在线聚类。在第一阶段,提出了一种用于存储数据流统计摘要的数据结构,称为 EDSSC 摘要,它可以更好地解决两个相互冲突的目标之间的困境:1)为子空间聚类(SC)的准确性保存更多点,2)为 DSC 的效率丢弃更多点。通过进一步提出一种估计子空间数的算法,所提出的 EDSSC 不需要知道子空间的数量。在第二阶段,提出了一种更合适的指标,称为平均稀疏度集中指数(ASCI),与传统使用的 SCI 指数相比,它大大提高了聚类精度。此外,还提出了基于 Page-Hinkley 检验的子空间演化检测模型,其中可以检测和适应出现、消失和重现的子空间。在真实数据流上的灭绝实验表明,EDSSC 优于最新的在线 SC 方法。