一种用于海量数据集贝叶斯分析的序贯蒙特卡罗方法。

A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets.

作者信息

Ridgeway Greg, Madigan David

机构信息

RAND, PO Box 2138, Santa Monica, CA 90407-2138,

出版信息

Data Min Knowl Discov. 2003 Jul 1;7(3):301-319. doi: 10.1023/A:1024084221803.

DOI:10.1023/A:1024084221803

PMID:19789656

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2753529/

Abstract

Markov chain Monte Carlo (MCMC) techniques revolutionized statistical practice in the 1990s by providing an essential toolkit for making the rigor and flexibility of Bayesian analysis computationally practical. At the same time the increasing prevalence of massive datasets and the expansion of the field of data mining has created the need for statistically sound methods that scale to these large problems. Except for the most trivial examples, current MCMC methods require a complete scan of the dataset for each iteration eliminating their candidacy as feasible data mining techniques.In this article we present a method for making Bayesian analysis of massive datasets computationally feasible. The algorithm simulates from a posterior distribution that conditions on a smaller, more manageable portion of the dataset. The remainder of the dataset may be incorporated by reweighting the initial draws using importance sampling. Computation of the importance weights requires a single scan of the remaining observations. While importance sampling increases efficiency in data access, it comes at the expense of estimation efficiency. A simple modification, based on the "rejuvenation" step used in particle filters for dynamic systems models, sidesteps the loss of efficiency with only a slight increase in the number of data accesses.To show proof-of-concept, we demonstrate the method on two examples. The first is a mixture of transition models that has been used to model web traffic and robotics. For this example we show that estimation efficiency is not affected while offering a 99% reduction in data accesses. The second example applies the method to Bayesian logistic regression and yields a 98% reduction in data accesses.

摘要

马尔可夫链蒙特卡罗（MCMC）技术在20世纪90年代彻底改变了统计实践，它提供了一个重要的工具包，使贝叶斯分析的严谨性和灵活性在计算上变得可行。与此同时，海量数据集的日益普及和数据挖掘领域的扩展，使得需要有统计上合理的方法来处理这些大规模问题。除了最微不足道的例子外，当前的MCMC方法在每次迭代时都需要对数据集进行完整扫描，这使其无法成为可行的数据挖掘技术。在本文中，我们提出了一种使对海量数据集进行贝叶斯分析在计算上可行的方法。该算法从后验分布中进行模拟，该后验分布以数据集的较小、更易于管理的部分为条件。数据集的其余部分可以通过使用重要性抽样对初始抽样进行重新加权来纳入。重要性权重的计算需要对其余观测值进行一次扫描。虽然重要性抽样提高了数据访问效率，但它是以估计效率为代价的。基于动态系统模型的粒子滤波器中使用的“复兴”步骤进行的一个简单修改，仅略微增加数据访问次数就避免了效率损失。为了证明概念，我们在两个例子上展示了该方法。第一个是用于对网络流量和机器人技术进行建模的转移模型混合。对于这个例子，我们表明估计效率不受影响，同时数据访问次数减少了99%。第二个例子将该方法应用于贝叶斯逻辑回归，数据访问次数减少了98%。