Maugis Cathy, Celeux Gilles, Martin-Magniette Marie-Laure
Department of Mathematics, University Paris-Sud 11, Orsay, France.
Biometrics. 2009 Sep;65(3):701-9. doi: 10.1111/j.1541-0420.2008.01160.x. Epub 2009 Feb 4.
This article is concerned with variable selection for cluster analysis. The problem is regarded as a model selection problem in the model-based cluster analysis context. A model generalizing the model of Raftery and Dean (2006, Journal of the American Statistical Association 101, 168-178) is proposed to specify the role of each variable. This model does not need any prior assumptions about the linear link between the selected and discarded variables. Models are compared with Bayesian information criterion. Variable role is obtained through an algorithm embedding two backward stepwise algorithms for variable selection for clustering and linear regression. The model identifiability is established and the consistency of the resulting criterion is proved under regularity conditions. Numerical experiments on simulated datasets and a genomic application highlight the interest of the procedure.
本文关注聚类分析中的变量选择。在基于模型的聚类分析背景下,该问题被视为一个模型选择问题。提出了一个推广Raftery和Dean(2006年,《美国统计协会杂志》101卷,168 - 178页)模型的模型,以明确每个变量的作用。该模型不需要对所选变量和舍弃变量之间的线性联系做任何先验假设。通过贝叶斯信息准则对模型进行比较。变量作用是通过一种算法获得的,该算法嵌入了用于聚类和线性回归变量选择的两种向后逐步算法。建立了模型可识别性,并在正则条件下证明了所得准则的一致性。在模拟数据集上的数值实验和一个基因组应用突出了该方法的价值。