Kasa Siva Rajesh, Rajan Vaibhav
School of Computing, National University of Singapore, COM1, 13, Computing Dr, Singapore, 117417, Singapore.
Sci Rep. 2023 Nov 6;13(1):19164. doi: 10.1038/s41598-023-44608-3.
Clustering is a fundamental tool for exploratory data analysis, and is ubiquitous across scientific disciplines. Gaussian Mixture Model (GMM) is a popular probabilistic and interpretable model for clustering. In many practical settings, the true data distribution, which is unknown, may be non-Gaussian and may be contaminated by noise or outliers. In such cases, clustering may still be done with a misspecified GMM. However, this may lead to incorrect classification of the underlying subpopulations. In this paper, we identify and characterize the problem of inferior clustering solutions. Similar to well-known spurious solutions, these inferior solutions have high likelihood and poor cluster interpretation; however, they differ from spurious solutions in other characteristics, such as asymmetry in the fitted components. We theoretically analyze this asymmetry and its relation to misspecification. We propose a new penalty term that is designed to avoid both inferior and spurious solutions. Using this penalty term, we develop a new model selection criterion and a new GMM-based clustering algorithm, SIA. We empirically demonstrate that, in cases of misspecification, SIA avoids inferior solutions and outperforms previous GMM-based clustering methods.
聚类是探索性数据分析的基本工具,在各科学学科中普遍存在。高斯混合模型(GMM)是一种流行的用于聚类的概率性且可解释的模型。在许多实际情况下,未知的真实数据分布可能是非高斯的,并且可能被噪声或离群值污染。在这种情况下,仍可以使用错误设定的GMM进行聚类。然而,这可能会导致对潜在子群体的错误分类。在本文中,我们识别并刻画了劣质聚类解的问题。与众所周知的虚假解类似,这些劣质解具有高似然性且聚类解释性差;然而,它们在其他特征上与虚假解不同,例如拟合成分中的不对称性。我们从理论上分析了这种不对称性及其与错误设定的关系。我们提出了一个新的惩罚项,旨在避免劣质解和虚假解。使用这个惩罚项,我们开发了一种新的模型选择准则和一种基于GMM的新聚类算法SIA。我们通过实验证明,在错误设定的情况下,SIA避免了劣质解,并且优于先前基于GMM的聚类方法。