Lo Kenneth, Gottardo Raphael
Department of Microbiology, University of Washington, Seattle, WA, USA.
Stat Comput. 2012 Jan 1;22(1):33-52. doi: 10.1007/s11222-010-9204-1.
Cluster analysis is the automated search for groups of homogeneous observations in a data set. A popular modeling approach for clustering is based on finite normal mixture models, which assume that each cluster is modeled as a multivariate normal distribution. However, the normality assumption that each component is symmetric is often unrealistic. Furthermore, normal mixture models are not robust against outliers; they often require extra components for modeling outliers and/or give a poor representation of the data. To address these issues, we propose a new class of distributions, multivariate t distributions with the Box-Cox transformation, for mixture modeling. This class of distributions generalizes the normal distribution with the more heavy-tailed t distribution, and introduces skewness via the Box-Cox transformation. As a result, this provides a unified framework to simultaneously handle outlier identification and data transformation, two interrelated issues. We describe an Expectation-Maximization algorithm for parameter estimation along with transformation selection. We demonstrate the proposed methodology with three real data sets and simulation studies. Compared with a wealth of approaches including the skew-t mixture model, the proposed t mixture model with the Box-Cox transformation performs favorably in terms of accuracy in the assignment of observations, robustness against model misspecification, and selection of the number of components.
聚类分析是在数据集中自动搜索同类观测值的组。一种流行的聚类建模方法基于有限正态混合模型,该模型假设每个聚类被建模为多元正态分布。然而,每个成分都是对称的正态性假设通常是不现实的。此外,正态混合模型对异常值不具有鲁棒性;它们通常需要额外的成分来对异常值进行建模和/或对数据的表示不佳。为了解决这些问题,我们提出了一类新的分布,即具有Box-Cox变换的多元t分布,用于混合建模。这类分布用更重尾的t分布推广了正态分布,并通过Box-Cox变换引入了偏度。因此,这提供了一个统一的框架来同时处理异常值识别和数据变换这两个相互关联的问题。我们描述了一种用于参数估计以及变换选择的期望最大化算法。我们用三个真实数据集和模拟研究展示了所提出的方法。与包括偏t混合模型在内的大量方法相比,所提出的具有Box-Cox变换的t混合模型在观测值分配的准确性、对模型误设的鲁棒性以及成分数量的选择方面表现良好。