避免使用错误指定的高斯混合模型产生劣质聚类。

Avoiding inferior clusterings with misspecified Gaussian mixture models.

作者信息

Kasa Siva Rajesh, Rajan Vaibhav

机构信息

School of Computing, National University of Singapore, COM1, 13, Computing Dr, Singapore, 117417, Singapore.

出版信息

Sci Rep. 2023 Nov 6;13(1):19164. doi: 10.1038/s41598-023-44608-3.

DOI:10.1038/s41598-023-44608-3

PMID:37932317

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10628229/

Abstract

Clustering is a fundamental tool for exploratory data analysis, and is ubiquitous across scientific disciplines. Gaussian Mixture Model (GMM) is a popular probabilistic and interpretable model for clustering. In many practical settings, the true data distribution, which is unknown, may be non-Gaussian and may be contaminated by noise or outliers. In such cases, clustering may still be done with a misspecified GMM. However, this may lead to incorrect classification of the underlying subpopulations. In this paper, we identify and characterize the problem of inferior clustering solutions. Similar to well-known spurious solutions, these inferior solutions have high likelihood and poor cluster interpretation; however, they differ from spurious solutions in other characteristics, such as asymmetry in the fitted components. We theoretically analyze this asymmetry and its relation to misspecification. We propose a new penalty term that is designed to avoid both inferior and spurious solutions. Using this penalty term, we develop a new model selection criterion and a new GMM-based clustering algorithm, SIA. We empirically demonstrate that, in cases of misspecification, SIA avoids inferior solutions and outperforms previous GMM-based clustering methods.

摘要

聚类是探索性数据分析的基本工具，在各科学学科中普遍存在。高斯混合模型（GMM）是一种流行的用于聚类的概率性且可解释的模型。在许多实际情况下，未知的真实数据分布可能是非高斯的，并且可能被噪声或离群值污染。在这种情况下，仍可以使用错误设定的GMM进行聚类。然而，这可能会导致对潜在子群体的错误分类。在本文中，我们识别并刻画了劣质聚类解的问题。与众所周知的虚假解类似，这些劣质解具有高似然性且聚类解释性差；然而，它们在其他特征上与虚假解不同，例如拟合成分中的不对称性。我们从理论上分析了这种不对称性及其与错误设定的关系。我们提出了一个新的惩罚项，旨在避免劣质解和虚假解。使用这个惩罚项，我们开发了一种新的模型选择准则和一种基于GMM的新聚类算法SIA。我们通过实验证明，在错误设定的情况下，SIA避免了劣质解，并且优于先前基于GMM的聚类方法。