Gergely Bence, Vargha András
Károli Gáspár University, Budapest, Hungary.
University of Amsterdam, Amsterdam, The Netherlands.
J Pers Oriented Res. 2021 Aug 26;7(1):22-35. doi: 10.17505/jpor.2021.23449. eCollection 2021.
Model-based cluster analysis (MBCA) was created to automatize the often subjective model-selection procedure of traditional explorative clustering methods. It is a type of finite mixture modelling, assuming that the data come from a mixture of different subpopulations following given distributions, typically multivariate normal. In that case cluster analysis is the exploration of the underlying mixture structure. In MBCA finding the possible number of clusters and the best clustering model is a statistical model-selection problem, where the models with differing number and type of component distributions are compared. For fitting a certain model MBCA uses a likelihood based Bayesian Information Criterion (BIC) to evaluate its appropriateness and the model with the highest BIC value is accepted as the final solution. The aim of the present study is to investigate the adequacy of automatic model selection in MBCA using BIC, and suggested alternative methods, like the Integrated Completed Likelihood Criterion (ICL), or Baudry's method. An additional aim is to refine these procedures by using so called quality coefficients (QCs), borrowed from methodological advances within the field of exploratory cluster analysis, to help in the choice of an appropriate cluster structure (CLS), and also to compare the efficiency of MBCA in identifying a theoretical CLS with those of various other clustering methods. The analyses are restricted to studying the performance of various procedures of the type described above for two classification situations, typical in person-oriented studies: (1) an example data set characterized by a perfect theoretical CLS with seven types (seven completely homogeneous clusters) was used to generate three data sets with varying degrees of measurement error added to the original values, and (2) three additional data sets based on another perfect theoretical CLS with four types. It was found that the automatic decision rarely led to an optimal solution. However, dropping solutions with irregular BIC curves, and using different QCs as an aid in choosing between different solutions generated by MBCA and by fusing close clusters, optimal solutions were achieved for the two classification situations studied. With this refined procedure the revealed cluster solutions of MBCA often proved to be at least as good as those of different hierarchical and -center clustering methods. MBCA was definitely superior in identifying four-type CLS models. In identifying seven-type CLS models MBCA performed at a similar level as the best of other clustering methods (such as -means) only when the reliability level of the input variables was high or moderate, otherwise it was slightly less efficient.
基于模型的聚类分析(MBCA)旨在实现传统探索性聚类方法中通常主观的模型选择过程自动化。它是一种有限混合建模类型,假设数据来自遵循给定分布(通常是多元正态分布)的不同子总体的混合。在这种情况下,聚类分析就是对潜在混合结构的探索。在MBCA中,确定可能的聚类数量和最佳聚类模型是一个统计模型选择问题,需要比较具有不同数量和类型成分分布的模型。为了拟合某个模型,MBCA使用基于似然的贝叶斯信息准则(BIC)来评估其适用性,具有最高BIC值的模型被接受为最终解决方案。本研究的目的是调查使用BIC的MBCA中自动模型选择的充分性,并提出替代方法,如积分完备似然准则(ICL)或鲍德里方法。另一个目的是通过使用从探索性聚类分析领域的方法进展中借鉴的所谓质量系数(QC)来完善这些程序,以帮助选择合适的聚类结构(CLS),并比较MBCA在识别理论CLS方面与其他各种聚类方法的效率。分析仅限于研究上述类型的各种程序在两种分类情况下的性能,这两种情况在以人为本的研究中很典型:(1)使用一个以具有七种类型(七个完全同质聚类)的完美理论CLS为特征的示例数据集来生成三个数据集,在原始值上添加了不同程度的测量误差;(2)基于另一个具有四种类型的完美理论CLS的另外三个数据集。研究发现,自动决策很少能得出最优解。然而,舍弃具有不规则BIC曲线的解,并使用不同的QC作为辅助在MBCA生成的不同解之间以及通过合并紧密聚类进行选择,对于所研究的两种分类情况都实现了最优解。通过这种完善的程序,MBCA揭示的聚类解通常被证明至少与不同的层次聚类和中心聚类方法的解一样好。在识别四种类型的CLS模型方面,MBCA绝对更具优势。在识别七种类型的CLS模型时,只有当输入变量的可靠性水平高或中等时,MBCA的表现才与其他最佳聚类方法(如均值法)处于相似水平,否则效率略低。