Suppr超能文献

SAIL:基于求和的信息论文本聚类增量学习。

SAIL: Summation-bAsed Incremental Learning for Information-Theoretic Text Clustering.

出版信息

IEEE Trans Cybern. 2013 Apr;43(2):570-84. doi: 10.1109/TSMCB.2012.2212430. Epub 2013 Mar 7.

Abstract

Information-theoretic clustering aims to exploit information-theoretic measures as the clustering criteria. A common practice on this topic is the so-called Info-Kmeans, which performs K-means clustering with KL-divergence as the proximity function. While expert efforts on Info-Kmeans have shown promising results, a remaining challenge is to deal with high-dimensional sparse data such as text corpora. Indeed, it is possible that the centroids contain many zero-value features for high-dimensional text vectors, which leads to infinite KL-divergence values and creates a dilemma in assigning objects to centroids during the iteration process of Info-Kmeans. To meet this challenge, in this paper, we propose a Summation-bAsed Incremental Learning (SAIL) algorithm for Info-Kmeans clustering. Specifically, by using an equivalent objective function, SAIL replaces the computation of KL-divergence by the incremental computation of Shannon entropy. This can avoid the zero-feature dilemma caused by the use of KL-divergence. To improve the clustering quality, we further introduce the variable neighborhood search scheme and propose the V-SAIL algorithm, which is then accelerated by a multithreaded scheme in PV-SAIL. Our experimental results on various real-world text collections have shown that, with SAIL as a booster, the clustering performance of Info-Kmeans can be significantly improved. Also, V-SAIL and PV-SAIL indeed help improve the clustering quality at a lower cost of computation.

摘要

信息论聚类旨在利用信息论测度作为聚类准则。在这个主题上的一个常见做法是所谓的 Info-Kmeans,它使用 KL 散度作为接近函数执行 K-均值聚类。虽然在 Info-Kmeans 方面的专家努力已经取得了有希望的结果,但仍然存在一个挑战是处理高维稀疏数据,如文本语料库。实际上,对于高维文本向量,质心可能包含许多零值特征,这会导致 KL 散度值无穷大,并在 Info-Kmeans 的迭代过程中给对象分配到质心带来困境。为了应对这一挑战,在本文中,我们提出了一种用于 Info-Kmeans 聚类的基于求和的增量学习(SAIL)算法。具体来说,通过使用等效的目标函数,SAIL 通过增量计算香农熵来替代 KL 散度的计算。这可以避免由于使用 KL 散度而导致的零特征困境。为了提高聚类质量,我们进一步引入了变量邻域搜索方案,并提出了 V-SAIL 算法,然后在 PV-SAIL 中通过多线程方案进行加速。我们在各种真实文本集合上的实验结果表明,通过使用 SAIL 作为助推器,可以显著提高 Info-Kmeans 的聚类性能。此外,V-SAIL 和 PV-SAIL 确实有助于以更低的计算成本提高聚类质量。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验