Scrucca Luca, Raftery Adrian E
Department of Economics, Università degli Studi di Perugia, Via A. Pascoli, 20, 06123 Perugia, Italy, URL: http://www.stat.unipg.it/luca.
Department of Statistics, University of Washington, Box 354320, Seattle, WA 98195-4320, United States of America, URL: http://www.stat.washington.edu/raftery/.
J Stat Softw. 2018 Apr;84. doi: 10.18637/jss.v084.i01. Epub 2018 Apr 17.
Finite mixture modeling provides a framework for cluster analysis based on parsimonious Gaussian mixture models. Variable or feature selection is of particular importance in situations where only a subset of the available variables provide clustering information. This enables the selection of a more parsimonious model, yielding more efficient estimates, a clearer interpretation and, often, improved clustering partitions. This paper describes the R package which performs subset selection for model-based clustering. An improved version of the Raftery and Dean (2006) methodology is implemented in the new release of the package to find the (locally) optimal subset of variables with group/cluster information in a dataset. Search over the solution space is performed using either a step-wise greedy search or a headlong algorithm. Adjustments for speeding up these algorithms are discussed, as well as a parallel implementation of the stepwise search. Usage of the package is presented through the discussion of several data examples.
有限混合模型为基于简约高斯混合模型的聚类分析提供了一个框架。在只有一部分可用变量提供聚类信息的情况下,变量或特征选择尤为重要。这使得能够选择一个更简约的模型,从而产生更有效的估计、更清晰的解释,并且通常能改进聚类划分。本文描述了一个用于基于模型的聚类进行子集选择的R包。该包的新版本实现了Raftery和Dean(2006)方法的改进版本,以在数据集中找到具有组/聚类信息的(局部)最优变量子集。使用逐步贪婪搜索或莽撞算法在解空间中进行搜索。讨论了加速这些算法的调整方法,以及逐步搜索的并行实现。通过几个数据示例的讨论展示了该包的用法。