Helgeson Erika S, Vock David M, Bair Eric
Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota.
Department of Endodontics and Biostatistics, University of North Carolina, Chapel Hill, North Carolina.
Biometrics. 2021 Dec;77(4):1215-1226. doi: 10.1111/biom.13376. Epub 2020 Oct 6.
Cluster analysis is an unsupervised learning strategy that is exceptionally useful for identifying homogeneous subgroups of observations in data sets of unknown structure. However, it is challenging to determine if the identified clusters represent truly distinct subgroups rather than noise. Existing approaches for addressing this problem tend to define clusters based on distributional assumptions, ignore the inherent correlation structure in the data, or are not suited for high-dimension low-sample size (HDLSS) settings. In this paper, we propose a novel method to evaluate the significance of identified clusters by comparing the explained variation due to the clustering from the original data to that produced by clustering a unimodal reference distribution that preserves the covariance structure in the data. The reference distribution is generated using kernel density estimation, and thus, does not require that the data follow a particular distribution. By utilizing sparse covariance estimation, the method is adapted for the HDLSS setting. The approach can be used to test the null hypothesis that the data cannot be partitioned into clusters and to determine the optimal number of clusters. Simulation examples, theoretical evaluations, and applications to temporomandibular disorder research and cancer microarray data illustrate the utility of the proposed method.
聚类分析是一种无监督学习策略,对于识别结构未知的数据集中观测值的同类子组非常有用。然而,确定所识别的聚类是否代表真正不同的子组而非噪声具有挑战性。解决此问题的现有方法往往基于分布假设来定义聚类,忽略数据中的固有相关结构,或者不适用于高维低样本量(HDLSS)设置。在本文中,我们提出了一种新方法,通过比较原始数据聚类所解释的变异与对保留数据协方差结构的单峰参考分布进行聚类所产生的变异,来评估所识别聚类的显著性。参考分布使用核密度估计生成,因此不需要数据遵循特定分布。通过利用稀疏协方差估计,该方法适用于HDLSS设置。该方法可用于检验数据不能被划分为聚类的零假设,并确定聚类的最优数量。模拟示例、理论评估以及在颞下颌关节紊乱研究和癌症微阵列数据中的应用说明了所提出方法的实用性。