Bradley Jonathan R
Department of Statistics, Florida State University, 117 N. Woodward Ave., Tallahassee, FL 32306-4330.
J Comput Graph Stat. 2021;30(4):889-905. doi: 10.1080/10618600.2021.1923518. Epub 2021 Jun 21.
The goal of this paper is to provide a way for Bayesian statisticians to incorporate subsampling directly into the Bayesian hierarchical model of their choosing without imposing additional restrictive model assumptions. We are motivated by the fact that the rise of "big data" has created difficulties for statisticians to directly apply their methods to big datasets. We introduce a "data subset model" to the popular "data model, process model, and parameter model" framework used to summarize Bayesian hierarchical models. The hyperparameters of the data subset model are specified constructively in that they are chosen such that the implied size of the subset satisfies pre-defined computational constraints. Thus, these hyperparameters effectively calibrate the statistical model to the computer itself to obtain predictions/estimations in a pre-specified amount of time. Several properties of the data subset model are provided including: propriety, partial sufficiency, and semi-parametric properties. Simulated datasets will be used to assess the consequences of subsampling, and results will be presented across different computers to show the effect of the computer on the statistical analysis. Additionally, we provide a joint analysis of a high-dimensional dataset (roughly 10 gigabytes) consisting of 2018 5-year period estimates from the US Census Bureau's Public Use Micro-Sample (PUMS).
本文的目标是为贝叶斯统计学家提供一种方法,使其能够在不施加额外严格模型假设的情况下,将子采样直接纳入其选择的贝叶斯层次模型。我们的动机源于这样一个事实:“大数据”的兴起给统计学家将其方法直接应用于大型数据集带来了困难。我们在用于总结贝叶斯层次模型的流行的“数据模型、过程模型和参数模型”框架中引入了一个“数据子集模型”。数据子集模型的超参数是通过构造性方式指定的,即它们的选择使得子集的隐含大小满足预定义的计算约束。因此,这些超参数有效地将统计模型校准到计算机本身,以便在预先指定的时间内获得预测/估计。文中给出了数据子集模型的几个性质,包括:恰当性、部分充分性和半参数性质。将使用模拟数据集来评估子采样的结果,并在不同计算机上展示结果,以显示计算机对统计分析的影响。此外,我们对一个高维数据集(约10GB)进行了联合分析,该数据集由美国人口普查局公共使用微观样本(PUMS)的2018个5年期估计值组成。