Zhong Shi
Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL 33431, USA.
Neural Netw. 2005 Jun-Jul;18(5-6):790-8. doi: 10.1016/j.neunet.2005.06.008.
Clustering data streams has been a new research topic, recently emerged from many real data mining applications, and has attracted a lot of research attention. However, there is little work on clustering high-dimensional streaming text data. This paper combines an efficient online spherical k-means (OSKM) algorithm with an existing scalable clustering strategy to achieve fast and adaptive clustering of text streams. The OSKM algorithm modifies the spherical k-means (SPKM) algorithm, using online update (for cluster centroids) based on the well-known Winner-Take-All competitive learning. It has been shown to be as efficient as SPKM, but much superior in clustering quality. The scalable clustering strategy was previously developed to deal with very large databases that cannot fit into a limited memory and that are too expensive to read/scan multiple times. Using the strategy, one keeps only sufficient statistics for history data to retain (part of) the contribution of history data and to accommodate the limited memory. To make the proposed clustering algorithm adaptive to data streams, we introduce a forgetting factor that applies exponential decay to the importance of history data. The older a set of text documents, the less weight they carry. Our experimental results demonstrate the efficiency of the proposed algorithm and reveal an intuitive and an interesting fact for clustering text streams-one needs to forget to be adaptive.
数据流聚类是一个新的研究课题,它最近出现在许多实际数据挖掘应用中,并引起了大量的研究关注。然而,针对高维流文本数据的聚类工作却很少。本文将一种高效的在线球形k均值(OSKM)算法与现有的可扩展聚类策略相结合,以实现文本流的快速自适应聚类。OSKM算法对球形k均值(SPKM)算法进行了改进,基于著名的胜者全得竞争学习使用在线更新(用于聚类中心)。结果表明,它与SPKM一样高效,但在聚类质量上要优越得多。可扩展聚类策略以前是为处理无法装入有限内存且读取/扫描多次成本过高的超大型数据库而开发的。使用该策略,只需保留历史数据的足够统计信息,以保留(部分)历史数据的贡献并适应有限的内存。为了使所提出的聚类算法能够适应数据流,我们引入了一个遗忘因子,该因子对历史数据的重要性应用指数衰减。一组文本文档越旧,其权重就越小。我们的实验结果证明了所提算法的有效性,并揭示了一个关于文本流聚类的直观且有趣的事实——为了实现自适应需要遗忘。