均值聚类的选择性推断。

Selective inference for -means clustering.

作者信息

Chen Yiqun T, Witten Daniela M

机构信息

Data Science Institute and Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA.

Departments of Statistics and Biostatistics, University of Washington, Seattle, WA 98195-4322, USA.

出版信息

J Mach Learn Res. 2023 May;24.

PMID:38264325

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10805457/

Abstract

We consider the problem of testing for a difference in means between clusters of observations identified via -means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of -means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the -means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using -means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data.

摘要

我们考虑通过K均值聚类识别的观测簇之间均值差异的检验问题。在这种情况下，经典假设检验会导致第一类错误率膨胀。在最近的工作中，Gao等人（2022年）在层次聚类的背景下考虑了一个相关问题。不幸的是，他们的解决方案是高度针对层次聚类背景的，因此不能应用于K均值聚类的情况。在本文中，我们提出了一个基于K均值算法中所有中间聚类分配的p值。我们表明，该p值在有限样本中控制了使用K均值聚类获得的一对簇之间均值差异检验的选择性第一类错误，并且可以有效地计算。我们将我们的方法应用于手写数字数据和单细胞RNA测序数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d08a/10805457/eae94b28613e/nihms-1916887-f0002.jpg

相似文献

Selective inference for -means clustering.

J Mach Learn Res. 2023 May;24.

Testing for a difference in means of a single feature after clustering.

ArXiv. 2023 Nov 27:arXiv:2311.16375v1.

Selective Inference for Hierarchical Clustering.

J Am Stat Assoc. 2024;119(545):332-342. doi: 10.1080/01621459.2022.2116331. Epub 2022 Oct 11.

Does Determination of Initial Cluster Centroids Improve the Performance of -Means Clustering Algorithm? Comparison of Three Hybrid Methods by Genetic Algorithm, Minimum Spanning Tree, and Hierarchical Clustering in an Applied Study.

Comput Math Methods Med. 2020 Aug 1;2020:7636857. doi: 10.1155/2020/7636857. eCollection 2020.

Merging -means with hierarchical clustering for identifying general-shaped groups.

Stat (Int Stat Inst). 2018;7(1). doi: 10.1002/sta4.172. Epub 2018 Jan 17.

Penalized unsupervised learning with outliers.

Stat Interface. 2013;6(2):211-221. doi: 10.4310/sii.2013.v6.n2.a5.

Sheep's coping style can be identified by unsupervised machine learning from unlabeled data.

Behav Processes. 2022 Jan;194:104559. doi: 10.1016/j.beproc.2021.104559. Epub 2021 Nov 25.

Subspace K-means clustering.

Behav Res Methods. 2013 Dec;45(4):1011-23. doi: 10.3758/s13428-013-0329-y.

Self-Adaptive Multiprototype-Based Competitive Learning Approach: A k-Means-Type Algorithm for Imbalanced Data Clustering.

IEEE Trans Cybern. 2021 Mar;51(3):1598-1612. doi: 10.1109/TCYB.2019.2916196. Epub 2021 Feb 17.

A Cheap Feature Selection Approach for the K-Means Algorithm.

IEEE Trans Neural Netw Learn Syst. 2021 May;32(5):2195-2208. doi: 10.1109/TNNLS.2020.3002576. Epub 2021 May 3.

引用本文的文献

Powerful significance testing for unbalanced clusters.

J Comput Graph Stat. 2025 Apr 16. doi: 10.1080/10618600.2025.2469756.

Federated k-means based on clusters backbone.

PLoS One. 2025 Jun 12;20(6):e0326145. doi: 10.1371/journal.pone.0326145. eCollection 2025.

Comment on "Data Fission: Splitting a Single Data Point", Data Fission for Unsupervised Learning: A Discussion on Post-Clustering Inference and the Challenges of Debiasing.

J Am Stat Assoc. 2025;120(549):174-175. doi: 10.1080/01621459.2024.2412191. Epub 2025 Apr 14.

Spatially Resolved Multiomics: Data Analysis from Monoomics to Multiomics.

BME Front. 2024 Jan 13;6:0084. doi: 10.34133/bmef.0084. eCollection 2025.

Testing for a difference in means of a single feature after clustering.

Biostatistics. 2024 Dec 31;26(1). doi: 10.1093/biostatistics/kxae046.

Neuroimaging-based variability in subtyping biomarkers for psychiatric heterogeneity.

Mol Psychiatry. 2025 May;30(5):1966-1975. doi: 10.1038/s41380-024-02807-y. Epub 2024 Nov 7.

Weighted families of contact maps to characterize conformational ensembles of (highly-)flexible proteins.

Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae627.

Testing for a difference in means of a single feature after clustering.

ArXiv. 2023 Nov 27:arXiv:2311.16375v1.

本文引用的文献

Selective Inference for Hierarchical Clustering.

J Am Stat Assoc. 2024;119(545):332-342. doi: 10.1080/01621459.2022.2116331. Epub 2022 Oct 11.

More Powerful Selective Inference for the Graph Fused Lasso.

J Comput Graph Stat. 2023;32(2):577-587. doi: 10.1080/10618600.2022.2097246. Epub 2022 Sep 6.

Testing for a Change in Mean After Changepoint Detection.

J R Stat Soc Series B Stat Methodol. 2022 Sep;84(4):1082-1104. doi: 10.1111/rssb.12501. Epub 2022 Apr 12.

Selection-Corrected Statistical Inference for Region Detection With High-Throughput Assays.

J Am Stat Assoc. 2019;114(527):1351-1365. doi: 10.1080/01621459.2018.1498347. Epub 2018 Nov 13.

Quantifying uncertainty in spikes estimated from calcium imaging data.

Biostatistics. 2023 Apr 14;24(2):481-501. doi: 10.1093/biostatistics/kxab034.

Exponential-Family Embedding With Application to Cell Developmental Trajectories for Single-Cell RNA-Seq Data.

J Am Stat Assoc. 2021;116(534):457-470. doi: 10.1080/01621459.2021.1886106. Epub 2021 Feb 8.

Data-Driven Strategies for Accelerated Materials Design.

Acc Chem Res. 2021 Feb 16;54(4):849-860. doi: 10.1021/acs.accounts.0c00785. Epub 2021 Feb 2.

Post-selection inference for changepoint detection algorithms with application to copy number variation data.

Biometrics. 2021 Sep;77(3):1037-1049. doi: 10.1111/biom.13422. Epub 2021 Jan 27.

Statistical significance of cluster membership for unsupervised evaluation of cell identities.

Bioinformatics. 2020 May 1;36(10):3107-3114. doi: 10.1093/bioinformatics/btaa087.

Eleven grand challenges in single-cell data science.

Genome Biol. 2020 Feb 7;21(1):31. doi: 10.1186/s13059-020-1926-6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

均值聚类的选择性推断。

Selective inference for -means clustering.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献