一种将子采样纳入通用贝叶斯层次模型的方法。

An Approach to Incorporate Subsampling into a Generic Bayesian Hierarchical Model.

作者信息

Bradley Jonathan R

机构信息

Department of Statistics, Florida State University, 117 N. Woodward Ave., Tallahassee, FL 32306-4330.

出版信息

J Comput Graph Stat. 2021;30(4):889-905. doi: 10.1080/10618600.2021.1923518. Epub 2021 Jun 21.

DOI:10.1080/10618600.2021.1923518

PMID:37138786

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10153567/

Abstract

The goal of this paper is to provide a way for Bayesian statisticians to incorporate subsampling directly into the Bayesian hierarchical model of their choosing without imposing additional restrictive model assumptions. We are motivated by the fact that the rise of "big data" has created difficulties for statisticians to directly apply their methods to big datasets. We introduce a "data subset model" to the popular "data model, process model, and parameter model" framework used to summarize Bayesian hierarchical models. The hyperparameters of the data subset model are specified constructively in that they are chosen such that the implied size of the subset satisfies pre-defined computational constraints. Thus, these hyperparameters effectively calibrate the statistical model to the computer itself to obtain predictions/estimations in a pre-specified amount of time. Several properties of the data subset model are provided including: propriety, partial sufficiency, and semi-parametric properties. Simulated datasets will be used to assess the consequences of subsampling, and results will be presented across different computers to show the effect of the computer on the statistical analysis. Additionally, we provide a joint analysis of a high-dimensional dataset (roughly 10 gigabytes) consisting of 2018 5-year period estimates from the US Census Bureau's Public Use Micro-Sample (PUMS).

摘要

本文的目标是为贝叶斯统计学家提供一种方法，使其能够在不施加额外严格模型假设的情况下，将子采样直接纳入其选择的贝叶斯层次模型。我们的动机源于这样一个事实：“大数据”的兴起给统计学家将其方法直接应用于大型数据集带来了困难。我们在用于总结贝叶斯层次模型的流行的“数据模型、过程模型和参数模型”框架中引入了一个“数据子集模型”。数据子集模型的超参数是通过构造性方式指定的，即它们的选择使得子集的隐含大小满足预定义的计算约束。因此，这些超参数有效地将统计模型校准到计算机本身，以便在预先指定的时间内获得预测/估计。文中给出了数据子集模型的几个性质，包括：恰当性、部分充分性和半参数性质。将使用模拟数据集来评估子采样的结果，并在不同计算机上展示结果，以显示计算机对统计分析的影响。此外，我们对一个高维数据集（约10GB）进行了联合分析，该数据集由美国人口普查局公共使用微观样本（PUMS）的2018个5年期估计值组成。

相似文献

An Approach to Incorporate Subsampling into a Generic Bayesian Hierarchical Model.一种将子采样纳入通用贝叶斯层次模型的方法。

J Comput Graph Stat. 2021;30(4):889-905. doi: 10.1080/10618600.2021.1923518. Epub 2021 Jun 21.

Parametric and nonparametric population methods: their comparative performance in analysing a clinical dataset and two Monte Carlo simulation studies.参数和非参数总体方法：它们在分析临床数据集和两项蒙特卡罗模拟研究中的比较性能。

Clin Pharmacokinet. 2006;45(4):365-83. doi: 10.2165/00003088-200645040-00003.

parallelMCMCcombine: an R package for bayesian methods for big data and analytics.parallelMCMCcombine：一个用于大数据和分析的贝叶斯方法的R包。

PLoS One. 2014 Sep 26;9(9):e108425. doi: 10.1371/journal.pone.0108425. eCollection 2014.

Genomic prediction using subsampling.使用子采样的基因组预测。

BMC Bioinformatics. 2017 Mar 24;18(1):191. doi: 10.1186/s12859-017-1582-3.

Part 2. Development of Enhanced Statistical Methods for Assessing Health Effects Associated with an Unknown Number of Major Sources of Multiple Air Pollutants.第2部分。开发增强的统计方法，以评估与多种空气污染物的未知数量主要来源相关的健康影响。

Res Rep Health Eff Inst. 2015 Jun(183 Pt 1-2):51-113.

Empirical Bayesian Analysis Through the Lens of a Particular Class of Constrained Bayesian Hierarchical Models.基于一类特定约束贝叶斯分层模型视角的经验贝叶斯分析

Stat. 2021 Dec;10(1). doi: 10.1002/sta4.403. Epub 2021 Jul 5.

Bayesian regression on non-parametric mixed-effect models with shape-restricted Bernstein polynomials.基于形状受限伯恩斯坦多项式的非参数混合效应模型的贝叶斯回归。

J Appl Stat. 2016 Feb 17;43(14):2524-2537. doi: 10.1080/02664763.2016.1142940. eCollection 2016.

Sampling Strategies for Fast Updating of Gaussian Markov Random Fields.高斯马尔可夫随机场快速更新的采样策略

Am Stat. 2021;75(1):52-65. doi: 10.1080/00031305.2019.1595144. Epub 2019 May 31.

Kernel-imbedded Gaussian processes for disease classification using microarray gene expression data.使用微阵列基因表达数据的用于疾病分类的核嵌入高斯过程。

BMC Bioinformatics. 2007 Feb 28;8:67. doi: 10.1186/1471-2105-8-67.

A Bayesian Approach for Summarizing and Modeling Time-Series Exposure Data with Left Censoring.贝叶斯方法在左删失时间序列暴露数据中的总结与建模。

Ann Work Expo Health. 2017 Aug 1;61(7):773-783. doi: 10.1093/annweh/wxx046.

本文引用的文献

A Case Study Competition Among Methods for Analyzing Large Spatial Data.大型空间数据分析方法的案例研究竞赛

J Agric Biol Environ Stat. 2019;24(3):398-425. doi: 10.1007/s13253-018-00348-w. Epub 2018 Dec 14.

Permutation and Grouping Methods for Sharpening Gaussian Process Approximations.用于锐化高斯过程近似的排列与分组方法

Technometrics. 2018;60(4):415-429. doi: 10.1080/00401706.2018.1437476. Epub 2018 Jun 18.

A multivariate spatial mixture model for areal data: examining regional differences in standardized test scores.一种用于区域数据的多元空间混合模型：检验标准化考试成绩的区域差异。

J R Stat Soc Ser C Appl Stat. 2014 Nov;63(5):737-761. doi: 10.1111/rssc.12061.

Sparse Multivariate Regression With Covariance Estimation.带协方差估计的稀疏多元回归

J Comput Graph Stat. 2010 Fall;19(4):947-962. doi: 10.1198/jcgs.2010.09188.

Identifying clusters in Bayesian disease mapping.在贝叶斯疾病地图绘制中识别聚类。

Biostatistics. 2014 Jul;15(3):457-69. doi: 10.1093/biostatistics/kxu005. Epub 2014 Mar 11.

Sparse estimation of a covariance matrix.协方差矩阵的稀疏估计。

Biometrika. 2011 Dec;98(4):807-820. doi: 10.1093/biomet/asr054.

Stochastic relaxation, gibbs distributions, and the bayesian restoration of images.随机松弛，吉布斯分布，以及贝叶斯图像恢复。

IEEE Trans Pattern Anal Mach Intell. 1984 Jun;6(6):721-41. doi: 10.1109/tpami.1984.4767596.

Improving the performance of predictive process modeling for large datasets.提高大型数据集的预测过程建模性能。

Comput Stat Data Anal. 2009 Jun 15;53(8):2873-2884. doi: 10.1016/j.csda.2008.09.008.

Gaussian predictive process models for large spatial data sets.用于大型空间数据集的高斯预测过程模型。

J R Stat Soc Series B Stat Methodol. 2008 Sep 1;70(4):825-848. doi: 10.1111/j.1467-9868.2008.00663.x.

Sparse inverse covariance estimation with the graphical lasso.使用图模型选择法进行稀疏逆协方差估计。

Biostatistics. 2008 Jul;9(3):432-41. doi: 10.1093/biostatistics/kxm045. Epub 2007 Dec 12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验