Witten Daniela M, Tibshirani Robert
J Am Stat Assoc. 2010 Jun 1;105(490):713-726. doi: 10.1198/jasa.2010.tm09415.
We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters. These approaches are demonstrated on simulated data and on genomic data sets.
我们考虑使用可能大量的特征集对观测值进行聚类的问题。人们可能会期望数据中真正潜在的聚类仅在一小部分特征上有所不同,如果使用全部特征集对观测值进行聚类,这些聚类可能会被遗漏。我们提出了一种用于稀疏聚类的新颖框架,其中使用自适应选择的特征子集对观测值进行聚类。该方法使用套索型惩罚来选择特征。我们使用这个框架来开发用于稀疏K均值和稀疏层次聚类的简单方法。一个单一的标准同时控制特征的选择和最终的聚类。这些方法在模拟数据和基因组数据集上得到了验证。