Google Research, New York, New York, USA.
Two Sigma Investments, New York, New York, USA.
Biometrics. 2023 Jun;79(2):940-950. doi: 10.1111/biom.13665. Epub 2022 Apr 22.
High-dimensional clustering analysis is a challenging problem in statistics and machine learning, with broad applications such as the analysis of microarray data and RNA-seq data. In this paper, we propose a new clustering procedure called spectral clustering with feature selection (SC-FS), where we first obtain an initial estimate of labels via spectral clustering, then select a small fraction of features with the largest R-squared with these labels, that is, the proportion of variation explained by group labels, and conduct clustering again using selected features. Under mild conditions, we prove that the proposed method identifies all informative features with high probability and achieves the minimax optimal clustering error rate for the sparse Gaussian mixture model. Applications of SC-FS to four real-world datasets demonstrate its usefulness in clustering high-dimensional data.
高维聚类分析是统计学和机器学习中的一个具有挑战性的问题,具有广泛的应用,如微阵列数据和 RNA-seq 数据的分析。在本文中,我们提出了一种新的聚类方法,称为带特征选择的谱聚类(SC-FS),其中我们首先通过谱聚类获得标签的初始估计,然后选择具有最大 R 平方的一小部分特征与这些标签,即组标签解释的方差比例,并使用选择的特征再次进行聚类。在温和的条件下,我们证明了所提出的方法以高概率识别所有信息丰富的特征,并为稀疏高斯混合模型实现了最优的聚类误差率。SC-FS 在四个真实数据集上的应用表明了它在高维数据聚类中的有用性。