无监督聚类的稳定性估计：综述

Stability estimation for unsupervised clustering: A review.

作者信息

Liu Tianmou, Yu Han, Blair Rachael Hageman

机构信息

Institute for Artificial Intelligence and Data Science State University of New York at Buffalo Buffalo New York USA.

Roswell Park Comprehensive Cancer Center Buffalo New York USA.

出版信息

Wiley Interdiscip Rev Comput Stat. 2022 Nov-Dec;14(6):e1575. doi: 10.1002/wics.1575. Epub 2022 Jan 9.

DOI:10.1002/wics.1575

PMID:36583207

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9787023/

Abstract

Cluster analysis remains one of the most challenging yet fundamental tasks in unsupervised learning. This is due in part to the fact that there are no labels or gold standards by which performance can be measured. Moreover, the wide range of clustering methods available is governed by different objective functions, different parameters, and dissimilarity measures. The purpose of clustering is versatile, often playing critical roles in the early stages of exploratory data analysis and as an endpoint for knowledge and discovery. Thus, understanding the quality of a clustering is of critical importance. The concept of has emerged as a strategy for assessing the performance and reproducibility of data clustering. The key idea is to produce perturbed data sets that are very close to the original, and cluster them. If the clustering is stable, then the clusters from the original data will be preserved in the perturbed data clustering. The nature of the perturbation, and the methods for quantifying similarity between clusterings, are nontrivial, and ultimately what distinguishes many of the stability estimation methods apart. In this review, we provide an overview of the very active research area of cluster stability estimation and discuss some of the open questions and challenges that remain in the field. This article is categorized under:Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification.

摘要

聚类分析仍然是无监督学习中最具挑战性但也是最基础的任务之一。部分原因在于，没有可用于衡量性能的标签或金标准。此外，现有的大量聚类方法受不同的目标函数、不同的参数和相异度度量所支配。聚类的目的是多方面的，在探索性数据分析的早期阶段常常发挥关键作用，并且作为知识发现的一个终点。因此，理解聚类的质量至关重要。聚类稳定性的概念已成为评估数据聚类性能和可重复性的一种策略。关键思想是生成与原始数据集非常接近的扰动数据集，并对其进行聚类。如果聚类是稳定的，那么原始数据中的聚类将保留在扰动数据聚类中。扰动的性质以及量化聚类之间相似性的方法并非易事，最终这也是区分许多稳定性估计方法的关键所在。在这篇综述中，我们概述了聚类稳定性估计这个非常活跃的研究领域，并讨论了该领域中仍然存在的一些开放性问题和挑战。本文分类如下：数据科学的统计学习与探索方法>聚类与分类。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4043/9787023/5322abeff07a/WICS-14-e1575-g003.jpg

相似文献

Stability estimation for unsupervised clustering: A review.无监督聚类的稳定性估计：综述

Wiley Interdiscip Rev Comput Stat. 2022 Nov-Dec;14(6):e1575. doi: 10.1002/wics.1575. Epub 2022 Jan 9.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

[Volume and health outcomes: evidence from systematic reviews and from evaluation of Italian hospital data].[容量与健康结果：来自系统评价和意大利医院数据评估的证据]

Epidemiol Prev. 2013 Mar-Jun;37(2-3 Suppl 2):1-100.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗：一项网状荟萃分析。

Cochrane Database Syst Rev. 2017 Dec 22;12(12):CD011535. doi: 10.1002/14651858.CD011535.pub2.

Interventions targeted at women to encourage the uptake of cervical screening.针对女性的干预措施，以鼓励她们接受宫颈癌筛查。

Cochrane Database Syst Rev. 2021 Sep 6;9(9):CD002834. doi: 10.1002/14651858.CD002834.pub3.

Interventions for promoting habitual exercise in people living with and beyond cancer.促进癌症患者及康复者进行习惯性锻炼的干预措施。

Cochrane Database Syst Rev. 2018 Sep 19;9(9):CD010192. doi: 10.1002/14651858.CD010192.pub3.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗：一项网状Meta分析。

Cochrane Database Syst Rev. 2020 Jan 9;1(1):CD011535. doi: 10.1002/14651858.CD011535.pub3.

Short-Term Memory Impairment短期记忆障碍

Electronic cigarettes for smoking cessation.电子烟戒烟。

Cochrane Database Syst Rev. 2024 Jan 8;1(1):CD010216. doi: 10.1002/14651858.CD010216.pub8.

Electronic cigarettes for smoking cessation.用于戒烟的电子烟。

Cochrane Database Syst Rev. 2025 Jan 29;1(1):CD010216. doi: 10.1002/14651858.CD010216.pub9.

引用本文的文献

Unsupervised clustering for sepsis identification in large-scale patient data: a model development and validation study.用于大规模患者数据中脓毒症识别的无监督聚类：一项模型开发与验证研究。

Intensive Care Med Exp. 2025 Mar 20;13(1):37. doi: 10.1186/s40635-025-00744-w.

Human AI collaboration for unsupervised categorization of live surgical feedback.人机协作实现手术实时反馈的无监督分类

NPJ Digit Med. 2024 Dec 20;7(1):372. doi: 10.1038/s41746-024-01383-3.

Robustness assessment of regressions using cluster analysis typologies: a bootstrap procedure with application in state sequence analysis.使用聚类分析类型学进行回归的稳健性评估：一种在状态序列分析中的应用的自助程序。

BMC Med Res Methodol. 2024 Dec 18;24(1):303. doi: 10.1186/s12874-024-02435-8.

Clustering Methods in Rheumatic and Musculoskeletal Disease Research: An Educational Guide to Best Research Practices.聚类方法在风湿和肌肉骨骼疾病研究中的应用：最佳研究实践的教育指南。

J Rheumatol. 2024 Dec 1;51(12):1160-1168. doi: 10.3899/jrheum.2024-0519.

The cluster D-trace loss for differential network analysis.用于差分网络分析的聚类D-迹损失

J Appl Stat. 2023 Aug 14;51(10):1843-1860. doi: 10.1080/02664763.2023.2245178. eCollection 2024.

Characterization of tumor evolution by functional clonality and phylogenetics in hepatocellular carcinoma.基于功能克隆性和系统发生学的肝癌肿瘤进化特征分析。

Commun Biol. 2024 Mar 29;7(1):383. doi: 10.1038/s42003-024-06040-9.

Systematic review and meta-analysis of disease clustering in multimorbidity: a study protocol.系统评价和荟萃分析：多病症中的疾病聚集现象研究方案。

BMJ Open. 2023 Dec 9;13(12):e076496. doi: 10.1136/bmjopen-2023-076496.

Comparative Analysis of the Clustering Quality in Self-Organizing Maps for Human Posture Classification.用于人体姿势分类的自组织映射中聚类质量的比较分析

Sensors (Basel). 2023 Sep 15;23(18):7925. doi: 10.3390/s23187925.

Multi-view clustering by CPS-merge analysis with application to multimodal single-cell data.通过 CPS-merge 分析的多角度聚类及其在多模态单细胞数据中的应用。

PLoS Comput Biol. 2023 Apr 17;19(4):e1011044. doi: 10.1371/journal.pcbi.1011044. eCollection 2023 Apr.

Clustering Deviation Index (CDI): a robust and accurate internal measure for evaluating scRNA-seq data clustering.聚类偏差指数（CDI）：一种稳健且准确的评估 scRNA-seq 数据聚类的内部度量指标。

Genome Biol. 2022 Dec 27;23(1):269. doi: 10.1186/s13059-022-02825-5.

本文引用的文献

A framework for stability-based module detection in correlation graphs.相关图中基于稳定性的模块检测框架。

Stat Anal Data Min. 2021 Apr;14(2):129-143. doi: 10.1002/sam.11495. Epub 2021 Jan 8.

Evaluating single-cell cluster stability using the Jaccard similarity index.使用 Jaccard 相似性指数评估单细胞聚类稳定性。

Bioinformatics. 2021 Aug 9;37(15):2212-2214. doi: 10.1093/bioinformatics/btaa956.

Identification of cell types from single cell data using stable clustering.基于稳定聚类的单细胞数据中的细胞类型鉴定。

Sci Rep. 2020 Jul 23;10(1):12349. doi: 10.1038/s41598-020-66848-3.

CPS analysis: self-contained validation of biomedical data clustering.CPS 分析：生物医学数据聚类的自包含验证。

Bioinformatics. 2020 Jun 1;36(11):3516-3521. doi: 10.1093/bioinformatics/btaa165.

Clustering multilayer omics data using MuNCut.使用 MuNCut 对多层组学数据进行聚类。

BMC Genomics. 2018 Mar 14;19(1):198. doi: 10.1186/s12864-018-4580-6.

Multivariate pattern analysis for MEG: A comparison of dissimilarity measures.多变量模式分析在脑磁图中的应用：相似性度量的比较。

Neuroimage. 2018 Jun;173:434-447. doi: 10.1016/j.neuroimage.2018.02.044. Epub 2018 Feb 27.

CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP.系统发育树的置信区间：一种使用自展法的方法。

Evolution. 1985 Jul;39(4):783-791. doi: 10.1111/j.1558-5646.1985.tb00420.x.

Validated and longitudinally stable asthma phenotypes based on cluster analysis of the ADEPT study.基于ADEPT研究聚类分析的经过验证且纵向稳定的哮喘表型。

Respir Res. 2016 Dec 15;17(1):165. doi: 10.1186/s12931-016-0482-9.

mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models.mclust 5：使用高斯有限混合模型进行聚类、分类和密度估计

R J. 2016 Aug;8(1):289-317.

Longitudinally Stable, Clinically Defined Clusters of Patients with Asthma Independently Identified in the ADEPT and U-BIOPRED Asthma Studies.在 ADEPT 和 U-BIOPRED 哮喘研究中，独立确定了具有哮喘的纵向稳定、临床定义的患者聚类。

Ann Am Thorac Soc. 2016 Mar;13 Suppl 1:S102-3. doi: 10.1513/AnnalsATS.201508-519MG.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

无监督聚类的稳定性估计：综述

Stability estimation for unsupervised clustering: A review.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献