基于向量重构的聚类算法，特别适用于大规模文本集。

School of Management and School of Computer Science and Technology, Harbin, China.

School of Management, Harbin, China.

Neural Netw. 2015 Mar;63:141-55. doi: 10.1016/j.neunet.2014.10.012. Epub 2014 Dec 9.

Along with the fast evolvement of internet technology, internet users have to face the large amount of textual data every day. Apparently, organizing texts into categories can help users dig the useful information from large-scale text collection. Clustering is one of the most promising tools for categorizing texts due to its unsupervised characteristic. Unfortunately, most of traditional clustering algorithms lose their high qualities on large-scale text collection, which mainly attributes to the high-dimensional vector space and semantic similarity among texts. To effectively and efficiently cluster large-scale text collection, this paper puts forward a vector reconstruction based clustering algorithm. Only the features that can represent the cluster are preserved in cluster's representative vector. This algorithm alternately repeats two sub-processes until it converges. One process is partial tuning sub-process, where feature's weight is fine-tuned by iterative process similar to self-organizing-mapping (SOM) algorithm. To accelerate clustering velocity, an intersection based similarity measurement and its corresponding neuron adjustment function are proposed and implemented in this sub-process. The other process is overall tuning sub-process, where the features are reallocated among different clusters. In this sub-process, the features useless to represent the cluster are removed from cluster's representative vector. Experimental results on the three text collections (including two small-scale and one large-scale text collections) demonstrate that our algorithm obtains high-quality performances on both small-scale and large-scale text collections.

随着互联网技术的快速发展，互联网用户每天都要面对大量的文本数据。显然，将文本组织成类别可以帮助用户从大规模文本集中挖掘有用信息。聚类是分类文本的最有前途的工具之一，因为它具有无监督的特点。不幸的是，大多数传统的聚类算法在大规模文本集上失去了高质量，这主要归因于高维向量空间和文本之间的语义相似性。为了有效地、高效地对大规模文本集进行聚类，本文提出了一种基于向量重构的聚类算法。只有能够代表聚类的特征才被保留在聚类的代表向量中。该算法交替重复两个子过程，直到收敛。一个过程是部分调整子过程，其中通过类似于自组织映射 (SOM) 算法的迭代过程来微调特征的权重。为了加速聚类速度，在这个子过程中提出并实现了基于交集的相似性度量及其相应的神经元调整函数。另一个过程是整体调整子过程，其中特征在不同的簇之间重新分配。在这个子过程中，从聚类的代表向量中删除了对表示聚类无用的特征。在三个文本集（包括两个小规模和一个大规模文本集）上的实验结果表明，我们的算法在小规模和大规模文本集上都能获得高质量的性能。

相似文献

A vector reconstruction based clustering algorithm particularly for large-scale text collection.

Neural Netw. 2015 Mar;63:141-55. doi: 10.1016/j.neunet.2014.10.012. Epub 2014 Dec 9.

Self-Taught convolutional neural networks for short text clustering.

Neural Netw. 2017 Apr;88:22-31. doi: 10.1016/j.neunet.2016.12.008. Epub 2017 Jan 12.

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network.

PLoS One. 2018 Oct 10;13(10):e0203670. doi: 10.1371/journal.pone.0203670. eCollection 2018.

Solving text clustering problem using a memetic differential evolution algorithm.

PLoS One. 2020 Jun 11;15(6):e0232816. doi: 10.1371/journal.pone.0232816. eCollection 2020.

Clustering: a neural network approach.

Neural Netw. 2010 Jan;23(1):89-107. doi: 10.1016/j.neunet.2009.08.007. Epub 2009 Aug 29.

Discovering Thematically Coherent Biomedical Documents Using Contextualized Bidirectional Encoder Representations from Transformers-Based Clustering.

Int J Environ Res Public Health. 2022 May 12;19(10):5893. doi: 10.3390/ijerph19105893.

Interval data clustering using self-organizing maps based on adaptive Mahalanobis distances.

Neural Netw. 2013 Oct;46:124-32. doi: 10.1016/j.neunet.2013.04.009. Epub 2013 May 7.

Integrating contextual information to enhance SOM-based text document clustering.

Neural Netw. 2002 Oct-Nov;15(8-9):1099-106. doi: 10.1016/s0893-6080(02)00082-5.

An incremental clustering method based on the boundary profile.

PLoS One. 2018 Apr 20;13(4):e0196108. doi: 10.1371/journal.pone.0196108. eCollection 2018.

Text classification algorithm of tourist attractions subcategories with modified TF-IDF and Word2Vec.

PLoS One. 2024 Oct 18;19(10):e0305095. doi: 10.1371/journal.pone.0305095. eCollection 2024.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

A vector reconstruction based clustering algorithm particularly for large-scale text collection.

Neural Netw. 2015 Mar;63:141-55. doi: 10.1016/j.neunet.2014.10.012. Epub 2014 Dec 9.

Self-Taught convolutional neural networks for short text clustering.

Neural Netw. 2017 Apr;88:22-31. doi: 10.1016/j.neunet.2016.12.008. Epub 2017 Jan 12.

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network.

PLoS One. 2018 Oct 10;13(10):e0203670. doi: 10.1371/journal.pone.0203670. eCollection 2018.

Solving text clustering problem using a memetic differential evolution algorithm.

PLoS One. 2020 Jun 11;15(6):e0232816. doi: 10.1371/journal.pone.0232816. eCollection 2020.

Clustering: a neural network approach.

Neural Netw. 2010 Jan;23(1):89-107. doi: 10.1016/j.neunet.2009.08.007. Epub 2009 Aug 29.

Discovering Thematically Coherent Biomedical Documents Using Contextualized Bidirectional Encoder Representations from Transformers-Based Clustering.

Int J Environ Res Public Health. 2022 May 12;19(10):5893. doi: 10.3390/ijerph19105893.

Interval data clustering using self-organizing maps based on adaptive Mahalanobis distances.

Neural Netw. 2013 Oct;46:124-32. doi: 10.1016/j.neunet.2013.04.009. Epub 2013 May 7.

Integrating contextual information to enhance SOM-based text document clustering.

Neural Netw. 2002 Oct-Nov;15(8-9):1099-106. doi: 10.1016/s0893-6080(02)00082-5.

An incremental clustering method based on the boundary profile.

PLoS One. 2018 Apr 20;13(4):e0196108. doi: 10.1371/journal.pone.0196108. eCollection 2018.

Text classification algorithm of tourist attractions subcategories with modified TF-IDF and Word2Vec.

PLoS One. 2024 Oct 18;19(10):e0305095. doi: 10.1371/journal.pone.0305095. eCollection 2024.

A vector reconstruction based clustering algorithm particularly for large-scale text collection.

机构信息

出版信息

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献