Liang Faming, Kim Jinsu, Song Qifan
Professor, Department of Biostatistics, University of Florida, Gainesville, FL 32611.
Enterprise Solution Specialist, Big Data Analysis Consulting Team, LG CNS, Seoul, Republic of Korea.
Technometrics. 2016;58(3):604-318. doi: 10.1080/00401706.2016.1142905. Epub 2016 Jul 8.
Markov chain Monte Carlo (MCMC) methods have proven to be a very powerful tool for analyzing data of complex structures. However, their computer-intensive nature, which typically require a large number of iterations and a complete scan of the full dataset for each iteration, precludes their use for big data analysis. In this paper, we propose the so-called bootstrap Metropolis-Hastings (BMH) algorithm, which provides a general framework for how to tame powerful MCMC methods to be used for big data analysis; that is to replace the full data log-likelihood by a Monte Carlo average of the log-likelihoods that are calculated in parallel from multiple bootstrap samples. The BMH algorithm possesses an embarrassingly parallel structure and avoids repeated scans of the full dataset in iterations, and is thus feasible for big data problems. Compared to the popular divide-and-combine method, BMH can be generally more efficient as it can asymptotically integrate the whole data information into a single simulation run. The BMH algorithm is very flexible. Like the Metropolis-Hastings algorithm, it can serve as a basic building block for developing advanced MCMC algorithms that are feasible for big data problems. This is illustrated in the paper by the tempering BMH algorithm, which can be viewed as a combination of parallel tempering and the BMH algorithm. BMH can also be used for model selection and optimization by combining with reversible jump MCMC and simulated annealing, respectively.
马尔可夫链蒙特卡罗(MCMC)方法已被证明是分析复杂结构数据的一种非常强大的工具。然而,其计算密集型的特性,通常每次迭代都需要大量的迭代次数和对整个数据集进行完整扫描,这使得它们无法用于大数据分析。在本文中,我们提出了所谓的自助梅特罗波利斯 - 黑斯廷斯(BMH)算法,该算法提供了一个通用框架,用于说明如何驾驭强大的MCMC方法以用于大数据分析;即通过对数似然的蒙特卡罗平均值来代替全数据对数似然,该对数似然是从多个自助样本并行计算得到的。BMH算法具有易于并行化的结构,并且避免了在迭代中对整个数据集进行重复扫描,因此对于大数据问题是可行的。与流行的分而组合方法相比,BMH通常效率更高,因为它可以渐近地将整个数据信息整合到一次模拟运行中。BMH算法非常灵活。与梅特罗波利斯 - 黑斯廷斯算法一样,它可以作为开发适用于大数据问题的高级MCMC算法的基本构建块。本文通过回火BMH算法对此进行了说明,该算法可以看作是并行回火和BMH算法的结合。BMH还可以分别与可逆跳跃MCMC和模拟退火相结合,用于模型选择和优化。