Suppr超能文献

基于向量重构的聚类算法,特别适用于大规模文本集。

A vector reconstruction based clustering algorithm particularly for large-scale text collection.

机构信息

School of Management and School of Computer Science and Technology, Harbin, China.

School of Management, Harbin, China.

出版信息

Neural Netw. 2015 Mar;63:141-55. doi: 10.1016/j.neunet.2014.10.012. Epub 2014 Dec 9.

Abstract

Along with the fast evolvement of internet technology, internet users have to face the large amount of textual data every day. Apparently, organizing texts into categories can help users dig the useful information from large-scale text collection. Clustering is one of the most promising tools for categorizing texts due to its unsupervised characteristic. Unfortunately, most of traditional clustering algorithms lose their high qualities on large-scale text collection, which mainly attributes to the high-dimensional vector space and semantic similarity among texts. To effectively and efficiently cluster large-scale text collection, this paper puts forward a vector reconstruction based clustering algorithm. Only the features that can represent the cluster are preserved in cluster's representative vector. This algorithm alternately repeats two sub-processes until it converges. One process is partial tuning sub-process, where feature's weight is fine-tuned by iterative process similar to self-organizing-mapping (SOM) algorithm. To accelerate clustering velocity, an intersection based similarity measurement and its corresponding neuron adjustment function are proposed and implemented in this sub-process. The other process is overall tuning sub-process, where the features are reallocated among different clusters. In this sub-process, the features useless to represent the cluster are removed from cluster's representative vector. Experimental results on the three text collections (including two small-scale and one large-scale text collections) demonstrate that our algorithm obtains high-quality performances on both small-scale and large-scale text collections.

摘要

随着互联网技术的快速发展,互联网用户每天都要面对大量的文本数据。显然,将文本组织成类别可以帮助用户从大规模文本集中挖掘有用信息。聚类是分类文本的最有前途的工具之一,因为它具有无监督的特点。不幸的是,大多数传统的聚类算法在大规模文本集上失去了高质量,这主要归因于高维向量空间和文本之间的语义相似性。为了有效地、高效地对大规模文本集进行聚类,本文提出了一种基于向量重构的聚类算法。只有能够代表聚类的特征才被保留在聚类的代表向量中。该算法交替重复两个子过程,直到收敛。一个过程是部分调整子过程,其中通过类似于自组织映射 (SOM) 算法的迭代过程来微调特征的权重。为了加速聚类速度,在这个子过程中提出并实现了基于交集的相似性度量及其相应的神经元调整函数。另一个过程是整体调整子过程,其中特征在不同的簇之间重新分配。在这个子过程中,从聚类的代表向量中删除了对表示聚类无用的特征。在三个文本集(包括两个小规模和一个大规模文本集)上的实验结果表明,我们的算法在小规模和大规模文本集上都能获得高质量的性能。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验