Zito Alessandro, Rigon Tommaso, Ovaskainen Otso, Dunson David B
Department of Statistical Science, Duke University, Durham, NC.
Department of Economics, Management and Statistics, University of Milano-Bicocca, Milan, Italy.
J Am Stat Assoc. 2023;118(544):2521-2532. doi: 10.1080/01621459.2022.2060835. Epub 2022 May 13.
We aim at modeling the appearance of distinct tags in a sequence of labeled objects. Common examples of this type of data include words in a corpus or distinct species in a sample. These sequential discoveries are often summarized via accumulation curves, which count the number of distinct entities observed in an increasingly large set of objects. We propose a novel Bayesian method for species sampling modeling by directly specifying the probability of a new discovery, therefore, allowing for flexible specifications. The asymptotic behavior and finite sample properties of such an approach are extensively studied. Interestingly, our enlarged class of sequential processes includes highly tractable special cases. We present a subclass of models characterized by appealing theoretical and computational properties, including one that shares the same discovery probability with the Dirichlet process. Moreover, due to strong connections with logistic regression models, the latter subclass can naturally account for covariates. We finally test our proposal on both synthetic and real data, with special emphasis on a large fungal biodiversity study in Finland. Supplementary materials for this article are available online.
我们旨在对一系列带标签对象中不同标签的外观进行建模。这类数据的常见示例包括语料库中的单词或样本中的不同物种。这些序列发现通常通过累积曲线进行总结,累积曲线计算在越来越大的对象集中观察到的不同实体的数量。我们提出了一种新颖的贝叶斯方法用于物种采样建模,通过直接指定新发现的概率,从而允许灵活的设定。我们广泛研究了这种方法的渐近行为和有限样本性质。有趣的是,我们扩展的序列过程类包括高度易处理的特殊情况。我们提出了一类具有吸引人的理论和计算性质的模型子类,包括一个与狄利克雷过程具有相同发现概率的模型。此外,由于与逻辑回归模型有很强的联系,后一个子类可以自然地考虑协变量。我们最终在合成数据和真实数据上测试了我们的提议,特别强调了芬兰的一项大型真菌生物多样性研究。本文的补充材料可在线获取。