具有重叠量化的有限混合模型总结

Summarizing Finite Mixture Model with Overlapping Quantification.

作者信息

Kyoya Shunki, Yamanishi Kenji

机构信息

Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan.

出版信息

Entropy (Basel). 2021 Nov 13;23(11):1503. doi: 10.3390/e23111503.

DOI:10.3390/e23111503

PMID:34828201

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8622449/

Abstract

Finite mixture models are widely used for modeling and clustering data. When they are used for clustering, they are often interpreted by regarding each component as one cluster. However, this assumption may be invalid when the components overlap. It leads to the issue of analyzing such overlaps to correctly understand the models. The primary purpose of this paper is to establish a theoretical framework for interpreting the overlapping mixture models by estimating how they overlap, using measures of information such as entropy and mutual information. This is achieved by merging components to regard multiple components as one cluster and summarizing the merging results. First, we propose three conditions that any merging criterion should satisfy. Then, we investigate whether several existing merging criteria satisfy the conditions and modify them to fulfill more conditions. Second, we propose a novel concept named clustering summarization to evaluate the merging results. In it, we can quantify how overlapped and biased the clusters are, using mutual information-based criteria. Using artificial and real datasets, we empirically demonstrate that our methods of modifying criteria and summarizing results are effective for understanding the cluster structures. We therefore give a new view of interpretability/explainability for model-based clustering.

摘要

有限混合模型被广泛用于数据建模和聚类。当它们用于聚类时，通常将每个组件视为一个聚类来进行解释。然而，当组件重叠时，这种假设可能无效。这就导致了分析此类重叠以正确理解模型的问题。本文的主要目的是通过使用熵和互信息等信息度量来估计重叠混合模型的重叠方式，从而建立一个解释重叠混合模型的理论框架。这是通过合并组件，将多个组件视为一个聚类并总结合并结果来实现的。首先，我们提出了任何合并准则都应满足的三个条件。然后，我们研究了几个现有的合并准则是否满足这些条件，并对它们进行修改以满足更多条件。其次，我们提出了一个名为聚类总结的新概念来评估合并结果。在这个概念中，我们可以使用基于互信息的准则来量化聚类的重叠程度和偏差程度。使用人工数据集和真实数据集，我们通过实验证明了我们修改准则和总结结果的方法对于理解聚类结构是有效的。因此，我们为基于模型的聚类的可解释性/可说明性提供了一个新的视角。