Shen Hui, Bhamidi Shankar, Liu Yufeng
Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, U.S.A.
Department of Statistics and Operations Research, Department of Genetics, and Department of Biostatistics, Carolina Center for Genome Sciences, Linberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, U.S.A.
J Comput Graph Stat. 2024;33(1):219-230. doi: 10.1080/10618600.2023.2219708. Epub 2023 Jul 20.
Clustering is a fundamental tool for exploratory data analysis. One central problem in clustering is deciding if the clusters discovered by clustering methods are reliable as opposed to being artifacts of natural sampling variation. Statistical significance of clustering (SigClust) is a recently developed cluster evaluation tool for high-dimension, low-sample size data. Despite its successful application to many scientific problems, there are cases where the original SigClust may not work well. Furthermore, for specific applications, researchers may not have access to the original data and only have the dissimilarity matrix. In this case, clustering is still a valuable exploratory tool, but the original SigClust is not applicable. To address these issues, we propose a new SigClust method using multidimensional scaling (MDS). The underlying idea behind MDS-based SigClust is that one can achieve low-dimensional representations of the original data via MDS using only the dissimilarity matrix and then apply SigClust on the low-dimensional MDS space. The proposed MDS-based SigClust can circumvent the challenge of parameter estimation of the original method in high-dimensional spaces while keeping the essential clustering structure in the MDS space. Both simulations and real data applications demonstrate that the proposed method works remarkably well for assessing the statistical significance of clustering. Supplemental materials for the article are available online.
聚类是探索性数据分析的一种基本工具。聚类中的一个核心问题是确定通过聚类方法发现的聚类是否可靠,而不是自然抽样变异的产物。聚类的统计显著性(SigClust)是一种最近开发的用于高维、低样本量数据的聚类评估工具。尽管它已成功应用于许多科学问题,但在某些情况下,原始的SigClust可能效果不佳。此外,对于特定应用,研究人员可能无法获取原始数据,而只有差异矩阵。在这种情况下,聚类仍然是一种有价值的探索性工具,但原始的SigClust不适用。为了解决这些问题,我们提出了一种使用多维缩放(MDS)的新SigClust方法。基于MDS的SigClust背后的基本思想是,人们可以仅使用差异矩阵通过MDS获得原始数据的低维表示,然后在低维MDS空间上应用SigClust。所提出的基于MDS的SigClust可以规避原始方法在高维空间中参数估计的挑战,同时在MDS空间中保留基本的聚类结构。模拟和实际数据应用均表明,所提出的方法在评估聚类的统计显著性方面效果显著。本文的补充材料可在线获取。