• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用自助法惩罚分析大型数据集。

Analyzing large datasets with bootstrap penalization.

作者信息

Fang Kuangnan, Ma Shuangge

机构信息

Department of Statistics, Xiamen University, Xiamen, Fujian, China.

Department of Biostatistics, Yale University, New Haven, CT, 06520, USA.

出版信息

Biom J. 2017 Mar;59(2):358-376. doi: 10.1002/bimj.201600052. Epub 2016 Nov 21.

DOI:10.1002/bimj.201600052
PMID:27870109
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5577005/
Abstract

Data with a large p (number of covariates) and/or a large n (sample size) are now commonly encountered. For many problems, regularization especially penalization is adopted for estimation and variable selection. The straightforward application of penalization to large datasets demands a "big computer" with high computational power. To improve computational feasibility, we develop bootstrap penalization, which dissects a big penalized estimation into a set of small ones, which can be executed in a highly parallel manner and each only demands a "small computer". The proposed approach takes different strategies for data with different characteristics. For data with a large p but a small to moderate n, covariates are first clustered into relatively homogeneous blocks. The proposed approach consists of two sequential steps. In each step and for each bootstrap sample, we select blocks of covariates and run penalization. The results from multiple bootstrap samples are pooled to generate the final estimate. For data with a large n but a small to moderate p, we bootstrap a small number of subjects, apply penalized estimation, and then conduct a weighted average over multiple bootstrap samples. For data with a large p and a large n, the natural marriage of the previous two methods is applied. Numerical studies, including simulations and data analysis, show that the proposed approach has computational and numerical advantages over the straightforward application of penalization. An R package has been developed to implement the proposed methods.

摘要

如今,我们经常会遇到具有大量协变量(p值较大)和/或大样本量(n值较大)的数据。对于许多问题,在估计和变量选择时会采用正则化方法,尤其是惩罚方法。将惩罚方法直接应用于大型数据集需要一台具有高计算能力的“大型计算机”。为了提高计算的可行性,我们开发了自助惩罚法,它将一个大型惩罚估计分解为一组小型估计,这些小型估计可以高度并行地执行,并且每个只需要一台“小型计算机”。所提出的方法针对具有不同特征的数据采用不同的策略。对于p值较大但n值较小到中等的数据,首先将协变量聚类为相对同质的块。所提出的方法包括两个连续步骤。在每个步骤中,对于每个自助样本,我们选择协变量块并进行惩罚。将多个自助样本的结果汇总以生成最终估计。对于n值较大但p值较小到中等的数据,我们对少量个体进行自助抽样,应用惩罚估计,然后对多个自助样本进行加权平均。对于p值和n值都较大的数据,则应用前两种方法的自然结合。数值研究,包括模拟和数据分析,表明所提出的方法在计算和数值方面比直接应用惩罚方法具有优势。我们已经开发了一个R包来实现所提出的方法。

相似文献

1
Analyzing large datasets with bootstrap penalization.使用自助法惩罚分析大型数据集。
Biom J. 2017 Mar;59(2):358-376. doi: 10.1002/bimj.201600052. Epub 2016 Nov 21.
2
Part 1. Statistical Learning Methods for the Effects of Multiple Air Pollution Constituents.第1部分. 多种空气污染成分影响的统计学习方法
Res Rep Health Eff Inst. 2015 Jun(183 Pt 1-2):5-50.
3
Integrative Analysis of Cancer Diagnosis Studies with Composite Penalization.采用复合惩罚的癌症诊断研究综合分析
Scand Stat Theory Appl. 2014 Mar 1;41(1):87-103. doi: 10.1111/j.1467-9469.2012.00816.x.
4
Variable selection for semiparametric regression models with iterated penalization.具有迭代惩罚的半参数回归模型的变量选择
J Nonparametr Stat. 2012 Jun 1;24(2):283-298. doi: 10.1080/10485252.2012.661054. Epub 2012 Apr 30.
5
Integrating approximate single factor graphical models.集成近似单因素图形模型。
Stat Med. 2020 Jan 30;39(2):146-155. doi: 10.1002/sim.8408. Epub 2019 Nov 20.
6
Integrative sparse partial least squares.综合稀疏偏最小二乘法。
Stat Med. 2021 Apr;40(9):2239-2256. doi: 10.1002/sim.8900. Epub 2021 Feb 8.
7
Analysis of small sample size studies using nonparametric bootstrap test with pooled resampling method.使用合并重采样方法的非参数自助检验对小样本量研究进行分析。
Stat Med. 2017 Jun 30;36(14):2187-2205. doi: 10.1002/sim.7263. Epub 2017 Mar 9.
8
L1 penalized continuation ratio models for ordinal response prediction using high-dimensional datasets.使用高维数据集进行有序响应预测的 L1 惩罚连续比模型。
Stat Med. 2012 Jun 30;31(14):1464-74. doi: 10.1002/sim.4484. Epub 2012 Feb 23.
9
A Penalization Method for Estimating Heterogeneous Covariate Effects in Cancer Genomic Data.一种用于估计癌症基因组数据中异质协变量效应的惩罚方法。
Genes (Basel). 2022 Apr 15;13(4):702. doi: 10.3390/genes13040702.
10
Multiple imputation with sequential penalized regression.多重插补与序贯惩罚回归。
Stat Methods Med Res. 2019 May;28(5):1311-1327. doi: 10.1177/0962280218755574. Epub 2018 Feb 16.

引用本文的文献

1
Hierarchical Multi-Label Classification With Gene-Environment Interactions in Disease Modeling.疾病建模中基于基因-环境相互作用的分层多标签分类
Stat Med. 2025 Feb 10;44(3-4):e10330. doi: 10.1002/sim.10330.
2
High-dimensional feature selection in competing risks modeling: A stable approach using a split-and-merge ensemble algorithm.竞争风险模型中的高维特征选择:一种使用分裂-合并集成算法的稳定方法。
Biom J. 2023 Feb;65(2):e2100164. doi: 10.1002/bimj.202100164. Epub 2022 Aug 7.
3
A generic Transcriptomics Reporting Framework (TRF) for 'omics data processing and analysis.

本文引用的文献

1
Pitfalls of hypothesis tests and model selection on bootstrap samples: Causes and consequences in biometrical applications.自抽样样本上假设检验和模型选择的陷阱:生物统计学应用中的原因与后果
Biom J. 2016 May;58(3):447-73. doi: 10.1002/bimj.201400246. Epub 2015 Sep 15.
2
Challenges of Big Data Analysis.大数据分析的挑战
Natl Sci Rev. 2014 Jun;1(2):293-314. doi: 10.1093/nsr/nwt032.
3
RANDOM LASSO.随机套索算法
一个用于“组学”数据处理和分析的通用转录组学报告框架(TRF)。
Regul Toxicol Pharmacol. 2017 Dec;91 Suppl 1(Suppl 1):S36-S45. doi: 10.1016/j.yrtph.2017.11.001. Epub 2017 Nov 4.
Ann Appl Stat. 2011 Mar 1;5(1):468-485. doi: 10.1214/10-AOAS377.
4
Computational solutions to large-scale data management and analysis.大规模数据管理和分析的计算解决方案。
Nat Rev Genet. 2010 Sep;11(9):647-57. doi: 10.1038/nrg2857.
5
Discussion of "Sure Independence Screening for Ultra-High Dimensional Feature Space.《超高维特征空间中的确定独立性筛选》讨论
J R Stat Soc Series B Stat Methodol. 2008 Nov;70(5):903. doi: 10.1111/j.1467-9868.2008.00674.x.
6
Regulation of gene expression in the mammalian eye and its relevance to eye disease.哺乳动物眼睛中基因表达的调控及其与眼病的相关性。
Proc Natl Acad Sci U S A. 2006 Sep 26;103(39):14429-34. doi: 10.1073/pnas.0602562103. Epub 2006 Sep 18.
7
Computational cluster validation in post-genomic data analysis.后基因组数据分析中的计算聚类验证
Bioinformatics. 2005 Aug 1;21(15):3201-12. doi: 10.1093/bioinformatics/bti517. Epub 2005 May 24.