用于大规模分布式纵向数据的小自助法集合

Bag of little bootstraps for massive and distributed longitudinal data.

作者信息

Zhou Xinkai, Zhou Jin J, Zhou Hua

机构信息

Department of Biostatistics, University of California, Los Angeles, California, USA.

Department of Medicine, University of California, Los Angeles, California, USA.

出版信息

Stat Anal Data Min. 2022 Jun;15(3):314-321. doi: 10.1002/sam.11563. Epub 2021 Nov 22.

DOI:10.1002/sam.11563

PMID:35656342

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9159544/

Abstract

Linear mixed models are widely used for analyzing longitudinal datasets, and the inference for variance component parameters relies on the bootstrap method. However, health systems and technology companies routinely generate massive longitudinal datasets that make the traditional bootstrap method infeasible. To solve this problem, we extend the highly scalable bag of little bootstraps method for independent data to longitudinal data and develop a highly efficient Julia package MixedModelsBLB.jl. Simulation experiments and real data analysis demonstrate the favorable statistical performance and computational advantages of our method compared to the traditional bootstrap method. For the statistical inference of variance components, it achieves 200 times speedup on the scale of 1 million subjects (20 million total observations), and is the only currently available tool that can handle more than 10 million subjects (200 million total observations) using desktop computers.

摘要

线性混合模型被广泛用于分析纵向数据集，并且对方差分量参数的推断依赖于自助法。然而，卫生系统和科技公司经常生成大量纵向数据集，这使得传统的自助法变得不可行。为了解决这个问题，我们将用于独立数据的高度可扩展的小自助法扩展到纵向数据，并开发了一个高效的Julia包MixedModelsBLB.jl。模拟实验和实际数据分析表明，与传统自助法相比，我们的方法具有良好的统计性能和计算优势。对于方差分量的统计推断，在100万受试者规模（总共2000万条观测值）上它实现了200倍的加速，并且是目前唯一可用的能使用台式计算机处理超过1000万受试者（总共2亿条观测值）的工具。

相似文献

Bag of little bootstraps for massive and distributed longitudinal data.用于大规模分布式纵向数据的小自助法集合

Stat Anal Data Min. 2022 Jun;15(3):314-321. doi: 10.1002/sam.11563. Epub 2021 Nov 22.

Adaptive choice of the number of bootstrap samples in large scale multiple testing.大规模多重检验中自举样本数量的自适应选择。

Stat Appl Genet Mol Biol. 2008;7(1):Article13. doi: 10.2202/1544-6115.1360. Epub 2008 Mar 24.

Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps.使用少量自展法对基因组规模系统发育树进行快速准确的自展置信区间估计。

Nat Comput Sci. 2021 Sep;1(9):573-577. doi: 10.1038/s43588-021-00129-5. Epub 2021 Sep 22.

A comparison of bootstrap approaches for estimating uncertainty of parameters in linear mixed-effects models.线性混合效应模型中用于估计参数不确定性的自助法比较。

Pharm Stat. 2013 May-Jun;12(3):129-40. doi: 10.1002/pst.1561. Epub 2013 Mar 4.

Conditional Non-parametric Bootstrap for Non-linear Mixed Effect Models.条件非参数自举法用于非线性混合效应模型。

Pharm Res. 2021 Jun;38(6):1057-1066. doi: 10.1007/s11095-021-03052-6. Epub 2021 Jun 1.

Variant-Kudu: An Efficient Tool kit Leveraging Distributed Bitmap Index for Analysis of Massive Genetic Variation Datasets.Variant-Kudu：利用分布式位图索引分析大规模遗传变异数据集的高效工具包。

J Comput Biol. 2020 Sep;27(9):1350-1360. doi: 10.1089/cmb.2019.0344. Epub 2020 Jan 6.

Additive quantile regression for clustered data with an application to children's physical activity.用于聚类数据的加法分位数回归及其在儿童身体活动中的应用。

J R Stat Soc Ser C Appl Stat. 2019 Aug;68(4):1071-1089. doi: 10.1111/rssc.12333. Epub 2018 Dec 25.

Scheduling-Guided Automatic Processing of Massive Hyperspectral Image Classification on Cloud Computing Architectures.云计算架构上大规模高光谱图像分类的调度引导自动处理。

IEEE Trans Cybern. 2021 Jul;51(7):3588-3601. doi: 10.1109/TCYB.2020.3026673. Epub 2021 Jun 23.

RealNeuralNetworks.jl: An Integrated Julia Package for Skeletonization, Morphological Analysis, and Synaptic Connectivity Analysis of Terabyte-Scale 3D Neural Segmentations.RealNeuralNetworks.jl：一个用于对万亿字节规模的3D神经分割进行骨架化、形态分析和突触连接性分析的集成Julia包。

Front Neuroinform. 2022 Mar 2;16:828169. doi: 10.3389/fninf.2022.828169. eCollection 2022.

ViraPipe: scalable parallel pipeline for viral metagenome analysis from next generation sequencing reads.ViraPipe：用于从下一代测序读取中进行病毒宏基因组分析的可扩展并行管道。

Bioinformatics. 2018 Mar 15;34(6):928-935. doi: 10.1093/bioinformatics/btx702.

本文引用的文献

WiSER: Robust and scalable estimation and inference of within-subject variances from intensive longitudinal data.WiSER：从密集纵向数据中进行稳健且可扩展的个体内方差的估计和推断。

Biometrics. 2022 Dec;78(4):1313-1327. doi: 10.1111/biom.13506. Epub 2021 Aug 1.

Insulin Dose and Cardiovascular Mortality in the ACCORD Trial.在ACCORD试验中的胰岛素剂量与心血管死亡率

Diabetes Care. 2015 Nov;38(11):2000-8. doi: 10.2337/dc15-0598. Epub 2015 Oct 13.

Effect of intensive treatment of hyperglycaemia on microvascular outcomes in type 2 diabetes: an analysis of the ACCORD randomised trial.强化血糖控制对 2 型糖尿病患者微血管结局的影响：ACCORD 随机试验分析。

Lancet. 2010 Aug 7;376(9739):419-30. doi: 10.1016/S0140-6736(10)60576-4. Epub 2010 Jun 30.

Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: a sleep dose-response study.睡眠限制及后续恢复期间的性能下降和恢复模式：一项睡眠剂量反应研究。

J Sleep Res. 2003 Mar;12(1):1-12. doi: 10.1046/j.1365-2869.2003.00337.x.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。