Ni Yang, Ji Yuan, Müller Peter
Department of Statistics, Texas A&M University.
Department of Public Health Sciences, The University of Chicago.
J Comput Graph Stat. 2020;29(4):703-714. doi: 10.1080/10618600.2020.1737085. Epub 2020 Apr 15.
We present a consensus Monte Carlo algorithm that scales existing Bayesian nonparametric models for clustering and feature allocation to big data. The algorithm is valid for any prior on random subsets such as partitions and latent feature allocation, under essentially any sampling model. Motivated by three case studies, we focus on clustering induced by a Dirichlet process mixture sampling model, inference under an Indian buffet process prior with a binomial sampling model, and with a categorical sampling model. We assess the proposed algorithm with simulation studies and show results for inference with three datasets: an MNIST image dataset, a dataset of pancreatic cancer mutations, and a large set of electronic health records (EHR). Supplementary materials for this article are available online.
我们提出了一种共识蒙特卡罗算法,该算法将现有的用于聚类和特征分配的贝叶斯非参数模型扩展到大数据。在本质上任何抽样模型下,该算法对于随机子集(如划分和潜在特征分配)的任何先验都是有效的。受三个案例研究的启发,我们专注于由狄利克雷过程混合抽样模型诱导的聚类、在具有二项式抽样模型的印度自助餐过程先验下的推断以及具有分类抽样模型的推断。我们通过模拟研究评估了所提出的算法,并展示了对三个数据集进行推断的结果:一个MNIST图像数据集、一个胰腺癌突变数据集以及一大组电子健康记录(EHR)。本文的补充材料可在线获取。