Psychological Methods Group, University of Amsterdam, Amsterdam, Netherlands.
Department of Methodology and Statistics, Tilburg University, Tilburg, Netherlands.
Behav Res Methods. 2023 Jun;55(4):2143-2156. doi: 10.3758/s13428-022-01883-8. Epub 2022 Jul 13.
Gaussian mixture models (GMMs) are a popular and versatile tool for exploring heterogeneity in multivariate continuous data. Arguably the most popular way to estimate GMMs is via the expectation-maximization (EM) algorithm combined with model selection using the Bayesian information criterion (BIC). If the GMM is correctly specified, this estimation procedure has been demonstrated to have high recovery performance. However, in many situations, the data are not continuous but ordinal, for example when assessing symptom severity in medical data or modeling the responses in a survey. For such situations, it is unknown how well the EM algorithm and the BIC perform in GMM recovery. In the present paper, we investigate this question by simulating data from various GMMs, thresholding them in ordinal categories and evaluating recovery performance. We show that the number of components can be estimated reliably if the number of ordinal categories and the number of variables is high enough. However, the estimates of the parameters of the component models are biased independent of sample size. Finally, we discuss alternative modeling approaches which might be adopted for the situations in which estimating a GMM is not acceptable.
高斯混合模型(GMM)是探索多元连续数据异质性的一种流行且通用的工具。可以说,估计 GMM 最流行的方法是通过期望最大化(EM)算法结合贝叶斯信息准则(BIC)进行模型选择。如果 GMM 得到正确指定,那么这种估计过程具有很高的恢复性能。然而,在许多情况下,数据不是连续的,而是有序的,例如在评估医疗数据中的症状严重程度或对调查中的反应进行建模时。对于这种情况,尚不清楚 EM 算法和 BIC 在 GMM 恢复中的性能如何。在本文中,我们通过模拟来自各种 GMM 的数据来研究这个问题,将它们在有序类别中进行阈值处理,并评估恢复性能。我们表明,如果有序类别和变量的数量足够高,则可以可靠地估计组件的数量。但是,无论样本量如何,组件模型的参数估计都是有偏的。最后,我们讨论了在不能接受估计 GMM 的情况下可能采用的替代建模方法。