摆脱基于贝叶斯模型的聚类中的维度诅咒
Escaping The Curse of Dimensionality in Bayesian Model-Based Clustering.
作者信息
Chandra Noirrit Kiran, Canale Antonio, Dunson David B
机构信息
Department of Mathematical Sciences The University of Texas at Dallas Richardson, TX, USA.
Department of Statistical Sciences University of Padova Padova, Italy.
出版信息
J Mach Learn Res. 2023 Apr;24.
Bayesian mixture models are widely used for clustering of high-dimensional data with appropriate uncertainty quantification. However, as the dimension of the observations increases, posterior inference often tends to favor too many or too few clusters. This article explains this behavior by studying the random partition posterior in a non-standard setting with a fixed sample size and increasing data dimensionality. We provide conditions under which the finite sample posterior tends to either assign every observation to a different cluster or all observations to the same cluster as the dimension grows. Interestingly, the conditions do not depend on the choice of clustering prior, as long as all possible partitions of observations into clusters have positive prior probabilities, and hold irrespective of the true data-generating model. We then propose a class of latent mixtures for Bayesian clustering (Lamb) on a set of low-dimensional latent variables inducing a partition on the observed data. The model is amenable to scalable posterior inference and we show that it can avoid the pitfalls of high-dimensionality under mild assumptions. The proposed approach is shown to have good performance in simulation studies and an application to inferring cell types based on scRNAseq.
贝叶斯混合模型被广泛用于对高维数据进行聚类,并进行适当的不确定性量化。然而,随着观测维度的增加,后验推断往往倾向于支持过多或过少的聚类。本文通过在固定样本量和数据维度增加的非标准设置下研究随机划分后验,来解释这种行为。我们提供了一些条件,在这些条件下,随着维度的增长,有限样本后验倾向于将每个观测分配到不同的聚类中,或者将所有观测分配到同一个聚类中。有趣的是,这些条件不依赖于聚类先验的选择,只要将观测划分为聚类的所有可能划分都具有正的先验概率,并且与真实的数据生成模型无关。然后,我们在一组低维潜在变量上提出了一类用于贝叶斯聚类(Lamb)的潜在混合模型,该模型在观测数据上诱导出一个划分。该模型适用于可扩展的后验推断,并且我们表明在温和假设下它可以避免高维性的陷阱。在模拟研究中,所提出的方法表现出良好的性能,并应用于基于单细胞RNA测序推断细胞类型。