van Havre Zoé, White Nicole, Rousseau Judith, Mengersen Kerrie
School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia; CEREMADE, Université Paris Dauphine, Paris, France.
School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia.
PLoS One. 2015 Jul 15;10(7):e0131739. doi: 10.1371/journal.pone.0131739. eCollection 2015.
This paper proposes solutions to three issues pertaining to the estimation of finite mixture models with an unknown number of components: the non-identifiability induced by overfitting the number of components, the mixing limitations of standard Markov Chain Monte Carlo (MCMC) sampling techniques, and the related label switching problem. An overfitting approach is used to estimate the number of components in a finite mixture model via a Zmix algorithm. Zmix provides a bridge between multidimensional samplers and test based estimation methods, whereby priors are chosen to encourage extra groups to have weights approaching zero. MCMC sampling is made possible by the implementation of prior parallel tempering, an extension of parallel tempering. Zmix can accurately estimate the number of components, posterior parameter estimates and allocation probabilities given a sufficiently large sample size. The results will reflect uncertainty in the final model and will report the range of possible candidate models and their respective estimated probabilities from a single run. Label switching is resolved with a computationally light-weight method, Zswitch, developed for overfitted mixtures by exploiting the intuitiveness of allocation-based relabelling algorithms and the precision of label-invariant loss functions. Four simulation studies are included to illustrate Zmix and Zswitch, as well as three case studies from the literature. All methods are available as part of the R package Zmix, which can currently be applied to univariate Gaussian mixture models.
本文针对有限混合模型(组件数量未知)估计中的三个问题提出了解决方案:因过度拟合组件数量导致的不可识别性、标准马尔可夫链蒙特卡罗(MCMC)采样技术的混合局限性以及相关的标签切换问题。一种过度拟合方法通过Zmix算法用于估计有限混合模型中的组件数量。Zmix在多维采样器和基于测试的估计方法之间架起了一座桥梁,通过选择先验来促使额外的组权重趋近于零。通过实施先验并行回火(并行回火的扩展)实现了MCMC采样。在样本量足够大的情况下,Zmix能够准确估计组件数量、后验参数估计值和分配概率。结果将反映最终模型中的不确定性,并将报告单次运行中可能的候选模型范围及其各自的估计概率。通过一种计算量较小的方法Zswitch解决了标签切换问题,Zswitch是为过度拟合的混合模型开发的,利用了基于分配的重新标记算法的直观性和标签不变损失函数的精确性。包含了四项模拟研究来说明Zmix和Zswitch,以及来自文献的三个案例研究。所有方法都作为R包Zmix的一部分提供,目前可应用于单变量高斯混合模型。