摆脱基于贝叶斯模型的聚类中的维度诅咒

Escaping The Curse of Dimensionality in Bayesian Model-Based Clustering.

作者信息

Chandra Noirrit Kiran, Canale Antonio, Dunson David B

机构信息

Department of Mathematical Sciences The University of Texas at Dallas Richardson, TX, USA.

Department of Statistical Sciences University of Padova Padova, Italy.

出版信息

J Mach Learn Res. 2023 Apr;24.

PMID:40236516

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11999651/

Abstract

Bayesian mixture models are widely used for clustering of high-dimensional data with appropriate uncertainty quantification. However, as the dimension of the observations increases, posterior inference often tends to favor too many or too few clusters. This article explains this behavior by studying the random partition posterior in a non-standard setting with a fixed sample size and increasing data dimensionality. We provide conditions under which the finite sample posterior tends to either assign every observation to a different cluster or all observations to the same cluster as the dimension grows. Interestingly, the conditions do not depend on the choice of clustering prior, as long as all possible partitions of observations into clusters have positive prior probabilities, and hold irrespective of the true data-generating model. We then propose a class of latent mixtures for Bayesian clustering (Lamb) on a set of low-dimensional latent variables inducing a partition on the observed data. The model is amenable to scalable posterior inference and we show that it can avoid the pitfalls of high-dimensionality under mild assumptions. The proposed approach is shown to have good performance in simulation studies and an application to inferring cell types based on scRNAseq.

摘要

贝叶斯混合模型被广泛用于对高维数据进行聚类，并进行适当的不确定性量化。然而，随着观测维度的增加，后验推断往往倾向于支持过多或过少的聚类。本文通过在固定样本量和数据维度增加的非标准设置下研究随机划分后验，来解释这种行为。我们提供了一些条件，在这些条件下，随着维度的增长，有限样本后验倾向于将每个观测分配到不同的聚类中，或者将所有观测分配到同一个聚类中。有趣的是，这些条件不依赖于聚类先验的选择，只要将观测划分为聚类的所有可能划分都具有正的先验概率，并且与真实的数据生成模型无关。然后，我们在一组低维潜在变量上提出了一类用于贝叶斯聚类（Lamb）的潜在混合模型，该模型在观测数据上诱导出一个划分。该模型适用于可扩展的后验推断，并且我们表明在温和假设下它可以避免高维性的陷阱。在模拟研究中，所提出的方法表现出良好的性能，并应用于基于单细胞RNA测序推断细胞类型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc59/11999651/bed6d44f4f1d/nihms-1925507-f0001.jpg

相似文献

Escaping The Curse of Dimensionality in Bayesian Model-Based Clustering.

J Mach Learn Res. 2023 Apr;24.

A Nonparametric Bayesian Model for Local Clustering with Application to Proteomics.

J Am Stat Assoc. 2013 Jan 1;108(503). doi: 10.1080/01621459.2013.784705.

Centered Partition Processes: Informative Priors for Clustering (with Discussion).

Bayesian Anal. 2021 Mar;16(1):301-370. doi: 10.1214/20-BA1197. Epub 2020 Feb 13.

A sparse factor model for clustering high-dimensional longitudinal data.

Stat Med. 2024 Aug 30;43(19):3633-3648. doi: 10.1002/sim.10151. Epub 2024 Jun 17.

clusterBMA: Bayesian model averaging for clustering.

PLoS One. 2023 Aug 21;18(8):e0288000. doi: 10.1371/journal.pone.0288000. eCollection 2023.

Graphical Dirichlet Process for Clustering Non-Exchangeable Grouped Data.

J Mach Learn Res. 2024;25.

Generalized species sampling priors with latent Beta reinforcements.

J Am Stat Assoc. 2014 Dec 1;109(508):1466-1480. doi: 10.1080/01621459.2014.950735.

From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering.

Adv Data Anal Classif. 2019;13(1):33-64. doi: 10.1007/s11634-018-0329-y. Epub 2018 Aug 24.

A Nonparametric Bayesian Model for Nested Clustering.

Methods Mol Biol. 2016;1362:129-41. doi: 10.1007/978-1-4939-3106-4_8.

A Bayesian approach to restricted latent class models for scientifically structured clustering of multivariate binary outcomes.

Biometrics. 2021 Dec;77(4):1431-1444. doi: 10.1111/biom.13388. Epub 2020 Oct 28.

引用本文的文献

Inferring Covariance Structure from Multiple Data Sources via Subspace Factor Analysis.

J Am Stat Assoc. 2025 Jun;120(550):1239-1253. doi: 10.1080/01621459.2024.2408777. Epub 2024 Dec 5.

本文引用的文献

Generalized infinite factorization models.

Biometrika. 2022 Sep;109(3):817-835. doi: 10.1093/biomet/asab056. Epub 2022 Jan 19.

Scalable Bayesian Nonparametric Clustering and Classification.

J Comput Graph Stat. 2020;29(1):53-65. doi: 10.1080/10618600.2019.1624366. Epub 2019 Jul 19.

Bayesian cumulative shrinkage for infinite factorizations.

Biometrika. 2020 Sep;107(3):745-752. doi: 10.1093/biomet/asaa008. Epub 2020 May 27.

Robust Bayesian inference via coarsening.

J Am Stat Assoc. 2019;114(527):1113-1125. doi: 10.1080/01621459.2018.1469995. Epub 2018 Aug 6.

From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering.

Adv Data Anal Classif. 2019;13(1):33-64. doi: 10.1007/s11634-018-0329-y. Epub 2018 Aug 24.

Challenges in unsupervised clustering of single-cell RNA-seq data.

Nat Rev Genet. 2019 May;20(5):273-282. doi: 10.1038/s41576-018-0088-9.

M3Drop: dropout-based feature selection for scRNASeq.

Bioinformatics. 2019 Aug 15;35(16):2865-2867. doi: 10.1093/bioinformatics/bty1044.

A Single-Cell Sequencing Guide for Immunologists.

Front Immunol. 2018 Oct 23;9:2425. doi: 10.3389/fimmu.2018.02425. eCollection 2018.

Mixture models with a prior on the number of components.

J Am Stat Assoc. 2018;113(521):340-356. doi: 10.1080/01621459.2016.1255636. Epub 2017 Nov 13.

Integrating single-cell transcriptomic data across different conditions, technologies, and species.

Nat Biotechnol. 2018 Jun;36(5):411-420. doi: 10.1038/nbt.4096. Epub 2018 Apr 2.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

摆脱基于贝叶斯模型的聚类中的维度诅咒

Escaping The Curse of Dimensionality in Bayesian Model-Based Clustering.

作者信息

Chandra Noirrit Kiran, Canale Antonio, Dunson David B

机构信息

Department of Mathematical Sciences The University of Texas at Dallas Richardson, TX, USA.

Department of Statistical Sciences University of Padova Padova, Italy.

出版信息

J Mach Learn Res. 2023 Apr;24.

PMID:40236516

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11999651/

Abstract

摘要

摆脱基于贝叶斯模型的聚类中的维度诅咒

Escaping The Curse of Dimensionality in Bayesian Model-Based Clustering.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

摆脱基于贝叶斯模型的聚类中的维度诅咒

Escaping The Curse of Dimensionality in Bayesian Model-Based Clustering.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献