Guo Jian, Levina Elizaveta, Michailidis George, Zhu Ji
Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, USA.
Biometrics. 2010 Sep;66(3):793-804. doi: 10.1111/j.1541-0420.2009.01341.x.
Variable selection for clustering is an important and challenging problem in high-dimensional data analysis. Existing variable selection methods for model-based clustering select informative variables in a "one-in-all-out" manner; that is, a variable is selected if at least one pair of clusters is separable by this variable and removed if it cannot separate any of the clusters. In many applications, however, it is of interest to further establish exactly which clusters are separable by each informative variable. To address this question, we propose a pairwise variable selection method for high-dimensional model-based clustering. The method is based on a new pairwise penalty. Results on simulated and real data show that the new method performs better than alternative approaches that use ℓ(1) and ℓ(∞) penalties and offers better interpretation.
聚类的变量选择是高维数据分析中的一个重要且具有挑战性的问题。现有的基于模型聚类的变量选择方法以“逐一进出”的方式选择信息变量;也就是说,如果至少有一对聚类可以通过该变量分离,则选择该变量,如果它不能分离任何聚类,则将其删除。然而,在许多应用中,进一步确定每个信息变量可以分离哪些聚类是很有意义的。为了解决这个问题,我们提出了一种用于基于高维模型聚类的成对变量选择方法。该方法基于一种新的成对惩罚。模拟数据和真实数据的结果表明,新方法比使用ℓ(1)和ℓ(∞)惩罚的替代方法表现更好,并且具有更好的解释性。