Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA.
Department of Oncologic Sciences, University of South Florida, Tampa, FL, USA.
BMC Bioinformatics. 2023 Mar 31;24(1):125. doi: 10.1186/s12859-023-05210-6.
Cluster analysis is utilized frequently in scientific theory and applications to separate data into groups. A key assumption in many clustering algorithms is that the data was generated from a population consisting of multiple distinct clusters. Clusterability testing allows users to question the inherent assumption of latent cluster structure, a theoretical requirement for meaningful results in cluster analysis.
This paper proposes methods for clusterability testing designed for high-dimensional data by utilizing sparse principal component analysis. Type I error and power of the clusterability tests are evaluated using simulated data with different types of cluster structure in high dimensions. Empirical performance of the new methods is evaluated and compared with prior methods on gene expression, microarray, and shotgun proteomics data. Our methods had reasonably low Type I error and maintained power for many datasets with a variety of structures and dimensions. Cluster structure was not detectable in other datasets with spatially close clusters.
This is the first analysis of clusterability testing on both simulated and real-world high-dimensional data.
聚类分析在科学理论和应用中经常被用来将数据分成组。许多聚类算法的一个关键假设是数据是由一个由多个不同簇组成的总体生成的。聚类能力检验使用户能够质疑潜在聚类结构的固有假设,这是聚类分析中获得有意义结果的理论要求。
本文提出了一种利用稀疏主成分分析对高维数据进行聚类能力检验的方法。利用高维中具有不同聚类结构的模拟数据评估了聚类能力检验的Ⅰ型错误和功效。在基因表达、微阵列和鸟枪法蛋白质组学数据上对新方法的经验性能进行了评估和比较。我们的方法具有合理低的Ⅰ型错误,并且对许多具有不同结构和维度的数据集保持了功效。在具有空间上接近聚类的其他数据集上,聚类结构是不可检测的。
这是首次对模拟和真实世界高维数据进行聚类能力检验的分析。