Chen Yiqun T, Witten Daniela M
Data Science Institute and Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA.
Departments of Statistics and Biostatistics, University of Washington, Seattle, WA 98195-4322, USA.
J Mach Learn Res. 2023 May;24.
We consider the problem of testing for a difference in means between clusters of observations identified via -means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of -means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the -means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using -means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data.
我们考虑通过K均值聚类识别的观测簇之间均值差异的检验问题。在这种情况下,经典假设检验会导致第一类错误率膨胀。在最近的工作中,Gao等人(2022年)在层次聚类的背景下考虑了一个相关问题。不幸的是,他们的解决方案是高度针对层次聚类背景的,因此不能应用于K均值聚类的情况。在本文中,我们提出了一个基于K均值算法中所有中间聚类分配的p值。我们表明,该p值在有限样本中控制了使用K均值聚类获得的一对簇之间均值差异检验的选择性第一类错误,并且可以有效地计算。我们将我们的方法应用于手写数字数据和单细胞RNA测序数据。