序列发现的贝叶斯建模

Bayesian Modeling of Sequential Discoveries.

作者信息

Zito Alessandro, Rigon Tommaso, Ovaskainen Otso, Dunson David B

机构信息

Department of Statistical Science, Duke University, Durham, NC.

Department of Economics, Management and Statistics, University of Milano-Bicocca, Milan, Italy.

出版信息

J Am Stat Assoc. 2023;118(544):2521-2532. doi: 10.1080/01621459.2022.2060835. Epub 2022 May 13.

DOI:10.1080/01621459.2022.2060835

PMID:38501061

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10947068/

Abstract

We aim at modeling the appearance of distinct tags in a sequence of labeled objects. Common examples of this type of data include words in a corpus or distinct species in a sample. These sequential discoveries are often summarized via accumulation curves, which count the number of distinct entities observed in an increasingly large set of objects. We propose a novel Bayesian method for species sampling modeling by directly specifying the probability of a new discovery, therefore, allowing for flexible specifications. The asymptotic behavior and finite sample properties of such an approach are extensively studied. Interestingly, our enlarged class of sequential processes includes highly tractable special cases. We present a subclass of models characterized by appealing theoretical and computational properties, including one that shares the same discovery probability with the Dirichlet process. Moreover, due to strong connections with logistic regression models, the latter subclass can naturally account for covariates. We finally test our proposal on both synthetic and real data, with special emphasis on a large fungal biodiversity study in Finland. Supplementary materials for this article are available online.

摘要

我们旨在对一系列带标签对象中不同标签的外观进行建模。这类数据的常见示例包括语料库中的单词或样本中的不同物种。这些序列发现通常通过累积曲线进行总结，累积曲线计算在越来越大的对象集中观察到的不同实体的数量。我们提出了一种新颖的贝叶斯方法用于物种采样建模，通过直接指定新发现的概率，从而允许灵活的设定。我们广泛研究了这种方法的渐近行为和有限样本性质。有趣的是，我们扩展的序列过程类包括高度易处理的特殊情况。我们提出了一类具有吸引人的理论和计算性质的模型子类，包括一个与狄利克雷过程具有相同发现概率的模型。此外，由于与逻辑回归模型有很强的联系，后一个子类可以自然地考虑协变量。我们最终在合成数据和真实数据上测试了我们的提议，特别强调了芬兰的一项大型真菌生物多样性研究。本文的补充材料可在线获取。

相似文献

Bayesian Modeling of Sequential Discoveries.

J Am Stat Assoc. 2023;118(544):2521-2532. doi: 10.1080/01621459.2022.2060835. Epub 2022 May 13.

Negative Binomial Process Count and Mixture Modeling.

IEEE Trans Pattern Anal Mach Intell. 2015 Feb;37(2):307-20. doi: 10.1109/TPAMI.2013.211.

Generalized species sampling priors with latent Beta reinforcements.

J Am Stat Assoc. 2014 Dec 1;109(508):1466-1480. doi: 10.1080/01621459.2014.950735.

Bivariate zero-inflated regression for count data: a Bayesian approach with application to plant counts.

Int J Biostat. 2010;6(1):Article 27. doi: 10.2202/1557-4679.1229.

Are Gibbs-Type Priors the Most Natural Generalization of the Dirichlet Process?

IEEE Trans Pattern Anal Mach Intell. 2015 Feb;37(2):212-29. doi: 10.1109/TPAMI.2013.217.

Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory.

Accid Anal Prev. 2005 Jan;37(1):35-46. doi: 10.1016/j.aap.2004.02.004.

Bayesian Kernel Mixtures for Counts.

J Am Stat Assoc. 2011 Dec 1;106(496):1528-1539. doi: 10.1198/jasa.2011.tm10552. Epub 2012 Jan 24.

Spatiotemporal Clustering with Neyman-Scott Processes via Connections to Bayesian Nonparametric Mixture Models.

J Am Stat Assoc. 2024;119(547):2382-2395. doi: 10.1080/01621459.2023.2257896. Epub 2023 Nov 9.

A Dirichlet process mixture of generalized Dirichlet distributions for proportional data modeling.

IEEE Trans Neural Netw. 2010 Jan;21(1):107-22. doi: 10.1109/TNN.2009.2034851. Epub 2009 Dec 4.

Rediscovery of Good-Turing estimators via Bayesian nonparametrics.

Biometrics. 2016 Mar;72(1):136-45. doi: 10.1111/biom.12366. Epub 2015 Jul 29.

引用本文的文献

Enriched Pitman-Yor processes.

Scand Stat Theory Appl. 2025 Jun;52(2):631-657. doi: 10.1111/sjos.12765. Epub 2025 Jan 19.

本文引用的文献

Fungal communities decline with urbanization-more in air than in soil.

ISME J. 2020 Nov;14(11):2806-2815. doi: 10.1038/s41396-020-0732-1. Epub 2020 Aug 5.

Are Gibbs-Type Priors the Most Natural Generalization of the Dirichlet Process?

IEEE Trans Pattern Anal Mach Intell. 2015 Feb;37(2):212-29. doi: 10.1109/TPAMI.2013.217.

Generalized species sampling priors with latent Beta reinforcements.

J Am Stat Assoc. 2014 Dec 1;109(508):1466-1480. doi: 10.1080/01621459.2014.950735.

Defining Predictive Probability Functions for Species Sampling Models.

Stat Sci. 2013;28(2):209-222. doi: 10.1214/12-sts407.

A new estimator of the discovery probability.

Biometrics. 2012 Dec;68(4):1188-96. doi: 10.1111/j.1541-0420.2012.01793.x. Epub 2012 Oct 1.

Estimating the number of unseen variants in the human genome.

Proc Natl Acad Sci U S A. 2009 Mar 31;106(13):5008-13. doi: 10.1073/pnas.0807815106. Epub 2009 Mar 10.

Molecular analysis of human forearm superficial skin bacterial biota.

Proc Natl Acad Sci U S A. 2007 Feb 20;104(8):2927-32. doi: 10.1073/pnas.0607077104. Epub 2007 Feb 9.

Counting the uncountable: statistical approaches to estimating microbial diversity.

Appl Environ Microbiol. 2001 Oct;67(10):4399-406. doi: 10.1128/AEM.67.10.4399-4406.2001.

On the analysis of accumulation curves.

Biometrics. 2000 Sep;56(3):748-54. doi: 10.1111/j.0006-341x.2000.00748.x.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

序列发现的贝叶斯建模

Bayesian Modeling of Sequential Discoveries.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献