Keefe Thomas H, Marron J S
Department of Statistics & O.R., UNC-Chapel Hill.
J Comput Graph Stat. 2025 Apr 16. doi: 10.1080/10618600.2025.2469756.
Clustering methods are popular for revealing structure in data, particularly in the high-dimensional setting common to contemporary data science. A central question is "are the clusters really there?" One pioneering method in statistical cluster validation is , but it is severely underpowered in the important setting where the candidate clusters have unbalanced sizes, such as in rare subtypes of disease. We show why this is the case and propose a remedy that is powerful in both the unbalanced and balanced settings, using a novel generalization of -means clustering. We illustrate the value of our method using a high-dimensional dataset of gene expression in kidney cancer patients. A Python implementation is available at https://github.com/thomaskeefe/sigclust.
聚类方法在揭示数据结构方面很受欢迎,尤其是在当代数据科学常见的高维环境中。一个核心问题是“聚类真的存在吗?”统计聚类验证中的一种开创性方法是 ,但在候选聚类大小不均衡的重要情况下,比如在疾病的罕见亚型中,它的功效严重不足。我们说明了为何会出现这种情况,并提出了一种在不均衡和均衡情况下都有效的补救方法,该方法使用了 -均值聚类的一种新颖推广。我们使用肾癌患者基因表达的高维数据集说明了我们方法的价值。可在https://github.com/thomaskeefe/sigclust获取Python实现代码。