Dai Yifan, Wu Di, Liu Yufeng
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
Department of Biomedical Sciences, Adams School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
Biometrics. 2025 Jul 3;81(3). doi: 10.1093/biomtc/ujaf120.
Clustering is widely used in biomedical research for meaningful subgroup identification. However, most existing clustering algorithms do not account for the statistical uncertainty of the resulting clusters and consequently may generate spurious clusters due to natural sampling variation. To address this problem, the Statistical Significance of Clustering (SigClust) method was developed to evaluate the significance of clusters in high-dimensional data. While SigClust has been successful in assessing clustering significance for continuous data, it is not specifically designed for discrete data, such as count data in genomics. Moreover, SigClust and its variations can suffer from reduced statistical power when applied to non-Gaussian high-dimensional data. To overcome these limitations, we propose SigClust-DEV, a method designed to evaluate the significance of clusters in count data. Through extensive simulations, we compare SigClust-DEV against other existing SigClust approaches across various count distributions and demonstrate its superior performance. Furthermore, we apply our proposed SigClust-DEV to Hydra single-cell RNA sequencing (scRNA) data and electronic health records (EHRs) of cancer patients to identify meaningful latent cell types and patient subgroups, respectively.
聚类在生物医学研究中被广泛用于有意义的亚组识别。然而,大多数现有的聚类算法没有考虑到所得聚类的统计不确定性,因此可能由于自然抽样变异而产生虚假聚类。为了解决这个问题,开发了聚类统计显著性(SigClust)方法来评估高维数据中聚类的显著性。虽然SigClust在评估连续数据的聚类显著性方面取得了成功,但它并非专门为离散数据设计,例如基因组学中的计数数据。此外,SigClust及其变体应用于非高斯高维数据时可能会出现统计功效降低的情况。为了克服这些限制,我们提出了SigClust-DEV,一种旨在评估计数数据中聚类显著性的方法。通过广泛的模拟,我们在各种计数分布上比较了SigClust-DEV与其他现有的SigClust方法,并证明了它的优越性能。此外,我们将提出的SigClust-DEV应用于九头蛇单细胞RNA测序(scRNA)数据和癌症患者的电子健康记录(EHR),分别识别有意义的潜在细胞类型和患者亚组。