一种用于聚类集成的可扩展框架。

A Scalable Framework For Cluster Ensembles.

作者信息

Hore Prodip, Hall Lawrence O, Goldgof Dmitry B

机构信息

Department of Computer Science and Engineering, ENB118, University of South Florida, Tampa, Florida 33620.

出版信息

Pattern Recognit. 2009 May;42(5):676-688. doi: 10.1016/j.patcog.2008.09.027.

DOI:10.1016/j.patcog.2008.09.027

PMID:20160846

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2654620/

Abstract

An ensemble of clustering solutions or partitions may be generated for a number of reasons. If the data set is very large, clustering may be done on tractable size disjoint subsets. The data may be distributed at different sites for which a distributed clustering solution with a final merging of partitions is a natural fit. In this paper, two new approaches to combining partitions, represented by sets of cluster centers, are introduced. The advantage of these approaches is that they provide a final partition of data that is comparable to the best existing approaches, yet scale to extremely large data sets. They can be 100,000 times faster while using much less memory. The new algorithms are compared against the best existing cluster ensemble merging approaches, clustering all the data at once and a clustering algorithm designed for very large data sets. The comparison is done for fuzzy and hard k-means based clustering algorithms. It is shown that the centroid-based ensemble merging algorithms presented here generate partitions of quality comparable to the best label vector approach or clustering all the data at once, while providing very large speedups.

摘要

出于多种原因，可能会生成一组聚类解决方案或划分。如果数据集非常大，可以对易于处理的大小不相交的子集进行聚类。数据可能分布在不同的站点，对于这种情况，具有最终划分合并的分布式聚类解决方案是很自然的选择。在本文中，介绍了两种以聚类中心集表示的组合划分的新方法。这些方法的优点是它们提供的数据最终划分与现有的最佳方法相当，但能扩展到极大的数据集。它们可以快10万倍，同时使用的内存要少得多。将新算法与现有的最佳聚类集成合并方法、一次性对所有数据进行聚类以及为非常大的数据集设计的聚类算法进行了比较。针对基于模糊和硬k均值的聚类算法进行了比较。结果表明，本文提出的基于质心的集成合并算法生成的划分质量与最佳标签向量方法或一次性对所有数据进行聚类相当，同时提供了非常大的加速比。

相似文献

A Scalable Framework For Cluster Ensembles.一种用于聚类集成的可扩展框架。

Pattern Recognit. 2009 May;42(5):676-688. doi: 10.1016/j.patcog.2008.09.027.

Hybrid fuzzy cluster ensemble framework for tumor clustering from biomolecular data.用于从生物分子数据中进行肿瘤聚类的混合模糊聚类集成框架。

IEEE/ACM Trans Comput Biol Bioinform. 2013 May-Jun;10(3):657-70. doi: 10.1109/TCBB.2013.59.

Combining multiple clusterings using evidence accumulation.使用证据积累合并多个聚类。

IEEE Trans Pattern Anal Mach Intell. 2005 Jun;27(6):835-50. doi: 10.1109/TPAMI.2005.113.

Examining unsupervised ensemble learning using spectroscopy data of organic compounds.使用有机化合物的光谱数据检验无监督集成学习。

J Comput Aided Mol Des. 2023 Jan;37(1):17-37. doi: 10.1007/s10822-022-00488-9. Epub 2022 Nov 21.

The k partition-distance problem.k划分距离问题。

J Comput Biol. 2012 Apr;19(4):404-17. doi: 10.1089/cmb.2010.0186.

Knowledge based cluster ensemble for cancer discovery from biomolecular data.基于知识的聚类集成在生物分子数据中的癌症发现。

IEEE Trans Nanobioscience. 2011 Jun;10(2):76-85. doi: 10.1109/TNB.2011.2144997. Epub 2011 Jul 7.

Clustering ensembles: models of consensus and weak partitions.聚类集成：共识模型与弱划分

IEEE Trans Pattern Anal Mach Intell. 2005 Dec;27(12):1866-81. doi: 10.1109/TPAMI.2005.237.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Moth-Flame Optimization-Bat Optimization: Map-Reduce Framework for Big Data Clustering Using the Moth-Flame Bat Optimization and Sparse Fuzzy C-Means. moth-flame 优化-蝙蝠优化：基于 moth-flame 蝙蝠优化和稀疏模糊 C 均值的大数据聚类的 Map-Reduce 框架。

Big Data. 2020 Jun;8(3):203-217. doi: 10.1089/big.2019.0125. Epub 2020 May 19.

Hybrid Sampling-Based Clustering Ensemble With Global and Local Constitutions.基于混合采样的聚类集成算法，具有全局和局部结构。

IEEE Trans Neural Netw Learn Syst. 2016 May;27(5):952-65. doi: 10.1109/TNNLS.2015.2430821.

引用本文的文献

STCC enhances spatial domain detection through consensus clustering of spatial transcriptomics data.STCC通过空间转录组学数据的一致性聚类增强空间域检测。

Genome Res. 2025 Jun 2;35(6):1415-1428. doi: 10.1101/gr.280031.124.

PARSUC: A Parallel Subsampling-Based Method for Clustering Remote Sensing Big Data.PARSUC：一种基于并行子采样的遥感大数据聚类方法。

Sensors (Basel). 2019 Aug 5;19(15):3438. doi: 10.3390/s19153438.

Scalable analysis of Big pathology image data cohorts using efficient methods and high-performance computing strategies.使用高效方法和高性能计算策略对大型病理学图像数据群组进行可扩展分析。

BMC Bioinformatics. 2015 Dec 1;16:399. doi: 10.1186/s12859-015-0831-6.

A Sparsification Approach for Temporal Graphical Model Decomposition.一种用于时态图形模型分解的稀疏化方法。

Proc IEEE Int Conf Data Min. 2009 Dec 6;2009:447-456. doi: 10.1109/ICDM.2009.67.

本文引用的文献

Complexity reduction for "large image" processing.用于“大图像”处理的复杂度降低

IEEE Trans Syst Man Cybern B Cybern. 2002;32(5):598-611. doi: 10.1109/TSMCB.2002.1033179.

On weighting clustering.关于加权聚类

IEEE Trans Pattern Anal Mach Intell. 2006 Aug;28(8):1223-35. doi: 10.1109/TPAMI.2006.168.

Clustering ensembles: models of consensus and weak partitions.聚类集成：共识模型与弱划分

IEEE Trans Pattern Anal Mach Intell. 2005 Dec;27(12):1866-81. doi: 10.1109/TPAMI.2005.237.

Scalable model-based clustering for large databases based on data summarization.基于数据汇总的大型数据库可扩展模型聚类

IEEE Trans Pattern Anal Mach Intell. 2005 Nov;27(11):1710-9. doi: 10.1109/TPAMI.2005.226.

Online clustering algorithms for radar emitter classification.用于雷达辐射源分类的在线聚类算法

IEEE Trans Pattern Anal Mach Intell. 2005 Aug;27(8):1185-96. doi: 10.1109/TPAMI.2005.166.

Combining multiple clusterings using evidence accumulation.使用证据积累合并多个聚类。

IEEE Trans Pattern Anal Mach Intell. 2005 Jun;27(6):835-50. doi: 10.1109/TPAMI.2005.113.

Bagging to improve the accuracy of a clustering procedure.通过装袋法提高聚类过程的准确性。

Bioinformatics. 2003 Jun 12;19(9):1090-9. doi: 10.1093/bioinformatics/btg038.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。