分布式系统中的并行谱聚类。

Parallel spectral clustering in distributed systems.

机构信息

Yahoo! Inc., Sunnyvale, CA 94089, USA.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2011 Mar;33(3):568-86. doi: 10.1109/TPAMI.2010.88.

DOI:10.1109/TPAMI.2010.88

PMID:20421667

Abstract

Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.

摘要

谱聚类算法在发现聚类方面比一些传统算法（如 k-均值）更有效。然而，当数据集的大小较大时，谱聚类在内存使用和计算时间方面都存在可扩展性问题。为了对大数据集进行聚类，我们研究了两种近似密集相似矩阵的代表性方法。我们通过稀疏化矩阵和 Nyström 方法对矩阵进行近似化，然后选择通过保留最近邻来稀疏化矩阵的策略，并研究其并行化。我们在分布式计算机上并行化内存使用和计算。通过对一个包含 193844 个实例的文档数据集和一个包含 2121863 个实例的照片数据集的实证研究，我们表明我们的并行算法可以有效地处理大型问题。

相似文献

Parallel spectral clustering in distributed systems.

IEEE Trans Pattern Anal Mach Intell. 2011 Mar;33(3):568-86. doi: 10.1109/TPAMI.2010.88.

Automated variable weighting in k-means type clustering.

IEEE Trans Pattern Anal Mach Intell. 2005 May;27(5):657-68. doi: 10.1109/TPAMI.2005.95.

Scalable model-based clustering for large databases based on data summarization.

IEEE Trans Pattern Anal Mach Intell. 2005 Nov;27(11):1710-9. doi: 10.1109/TPAMI.2005.226.

A genetic algorithm using hyper-quadtrees for low-dimensional K-means clustering.

IEEE Trans Pattern Anal Mach Intell. 2006 Apr;28(4):533-43. doi: 10.1109/TPAMI.2006.66.

Evaluation of stability of k-means cluster ensembles with respect to random initialization.

IEEE Trans Pattern Anal Mach Intell. 2006 Nov;28(11):1798-808. doi: 10.1109/TPAMI.2006.226.

Weighted graph cuts without eigenvectors a multilevel approach.

IEEE Trans Pattern Anal Mach Intell. 2007 Nov;29(11):1944-57. doi: 10.1109/TPAMI.2007.1115.

Combining multiple clusterings using evidence accumulation.

IEEE Trans Pattern Anal Mach Intell. 2005 Jun;27(6):835-50. doi: 10.1109/TPAMI.2005.113.

A novel kernel method for clustering.

IEEE Trans Pattern Anal Mach Intell. 2005 May;27(5):801-5. doi: 10.1109/TPAMI.2005.88.

Density-weighted Nyström method for computing large kernel eigensystems.

Neural Comput. 2009 Jan;21(1):121-46. doi: 10.1162/neco.2008.11-07-651.

Online clustering algorithms for radar emitter classification.

IEEE Trans Pattern Anal Mach Intell. 2005 Aug;27(8):1185-96. doi: 10.1109/TPAMI.2005.166.

引用本文的文献

Fast sparse representative tree splitting via local density for large-scale clustering.

Sci Rep. 2025 Aug 11;15(1):29398. doi: 10.1038/s41598-025-13848-w.

Design of intelligent financial data management system based on higher-order hybrid clustering algorithm.

PeerJ Comput Sci. 2024 Jan 24;10:e1799. doi: 10.7717/peerj-cs.1799. eCollection 2024.

A Clustering Method of Case-Involved News by Combining Topic Network and Multi-Head Attention Mechanism.

Sensors (Basel). 2021 Nov 11;21(22):7501. doi: 10.3390/s21227501.

Approximate spectral clustering using both reference vectors and topology of the network generated by growing neural gas.

PeerJ Comput Sci. 2021 Aug 20;7:e679. doi: 10.7717/peerj-cs.679. eCollection 2021.

Evaluation of clustering and topic modeling methods over health-related tweets and emails.

Artif Intell Med. 2021 Jul;117:102096. doi: 10.1016/j.artmed.2021.102096. Epub 2021 May 7.

DeepAISE - An interpretable and recurrent neural survival model for early prediction of sepsis.

Artif Intell Med. 2021 Mar;113:102036. doi: 10.1016/j.artmed.2021.102036. Epub 2021 Feb 13.

Social big data: Recent achievements and new challenges.

Inf Fusion. 2016 Mar;28:45-59. doi: 10.1016/j.inffus.2015.08.005. Epub 2015 Aug 28.

ClusterEnG: an interactive educational web resource for clustering and visualizing high-dimensional data.

PeerJ Comput Sci. 2018;4. doi: 10.7717/peerj-cs.155. Epub 2018 May 21.

Spectral clustering using Nyström approximation for the accurate identification of cancer molecular subtypes.

Sci Rep. 2017 Jul 7;7(1):4896. doi: 10.1038/s41598-017-05275-3.

Clustering Acoustic Segments Using Multi-Stage Agglomerative Hierarchical Clustering.

PLoS One. 2015 Oct 30;10(10):e0141756. doi: 10.1371/journal.pone.0141756. eCollection 2015.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

分布式系统中的并行谱聚类。

Parallel spectral clustering in distributed systems.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献