在高维自助抽样样本中针对有偏复杂度选择调整预测误差估计值。

Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples.

作者信息

Binder Harald, Schumacher Martin

机构信息

Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg.

出版信息

Stat Appl Genet Mol Biol. 2008;7(1):Article12. doi: 10.2202/1544-6115.1346. Epub 2008 Mar 14.

DOI:10.2202/1544-6115.1346

PMID:18384265

Abstract

The bootstrap is a tool that allows for efficient evaluation of prediction performance of statistical techniques without having to set aside data for validation. This is especially important for high-dimensional data, e.g., arising from microarrays, because there the number of observations is often limited. For avoiding overoptimism the statistical technique to be evaluated has to be applied to every bootstrap sample in the same manner it would be used on new data. This includes a selection of complexity, e.g., the number of boosting steps for gradient boosting algorithms. Using the latter, we demonstrate in a simulation study that complexity selection in conventional bootstrap samples, drawn with replacement, is severely biased in many scenarios. This translates into a considerable bias of prediction error estimates, often underestimating the amount of information that can be extracted from high-dimensional data. Potential remedies for this complexity selection bias, such as alternatively using a fixed level of complexity or of using sampling without replacement are investigated and it is shown that the latter works well in many settings. We focus on high-dimensional binary response data, with bootstrap .632+ estimates of the Brier score for performance evaluation, and censored time-to-event data with .632+ prediction error curve estimates. The latter, with the modified bootstrap procedure, is then applied to an example with microarray data from patients with diffuse large B-cell lymphoma.

摘要

自助法是一种工具，它能够在无需留出数据用于验证的情况下，对统计技术的预测性能进行有效评估。这对于高维数据（例如源自微阵列的数据）尤为重要，因为在这类数据中观测值的数量往往有限。为避免过度乐观，待评估的统计技术必须以与应用于新数据相同的方式应用于每个自助样本。这包括复杂度的选择，例如梯度提升算法的提升步数。利用后者，我们在一项模拟研究中表明，在有放回抽取的传统自助样本中进行复杂度选择在许多情况下存在严重偏差。这会转化为预测误差估计的相当大偏差，常常低估可从高维数据中提取的信息量。我们研究了针对这种复杂度选择偏差的潜在补救措施，例如改用固定的复杂度水平或进行无放回抽样，结果表明后者在许多情况下效果良好。我们专注于高维二元响应数据，使用自助法.632 + 估计的布里尔分数进行性能评估，以及针对删失事件发生时间数据使用.632 + 预测误差曲线估计。然后，通过修改后的自助程序，将后者应用于一个来自弥漫性大B细胞淋巴瘤患者的微阵列数据示例。

相似文献

Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples.在高维自助抽样样本中针对有偏复杂度选择调整预测误差估计值。

Stat Appl Genet Mol Biol. 2008;7(1):Article12. doi: 10.2202/1544-6115.1346. Epub 2008 Mar 14.

Boosting for high-dimensional time-to-event data with competing risks.具有竞争风险的高维生存时间数据的增强方法

Bioinformatics. 2009 Apr 1;25(7):890-6. doi: 10.1093/bioinformatics/btp088. Epub 2009 Feb 25.

A general, prediction error-based criterion for selecting model complexity for high-dimensional survival models.一种基于广义预测误差准则的高维生存模型选择模型复杂度的方法。

Stat Med. 2010 Mar 30;29(7-8):830-8. doi: 10.1002/sim.3765.

Assessment of survival prediction models based on microarray data.基于微阵列数据的生存预测模型评估。

Bioinformatics. 2007 Jul 15;23(14):1768-74. doi: 10.1093/bioinformatics/btm232. Epub 2007 May 7.

Prediction error estimation: a comparison of resampling methods.预测误差估计：重采样方法的比较

Bioinformatics. 2005 Aug 1;21(15):3301-7. doi: 10.1093/bioinformatics/bti499. Epub 2005 May 19.

A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification.用于估计微阵列分类中预测误差的自助法与调整后的自助法的比较。

Stat Med. 2007 Dec 20;26(29):5320-34. doi: 10.1002/sim.2968.

Boosting method for nonlinear transformation models with censored survival data.用于删失生存数据的非线性变换模型的提升方法。

Biostatistics. 2008 Oct;9(4):658-67. doi: 10.1093/biostatistics/kxn005. Epub 2008 Mar 15.

An evaluation of resampling methods for assessment of survival risk prediction in high-dimensional settings.高维环境下评估生存风险预测的重采样方法评估。

Stat Med. 2011 Mar 15;30(6):642-53. doi: 10.1002/sim.4106. Epub 2010 Dec 1.

An overview of techniques for linking high-dimensional molecular data to time-to-event endpoints by risk prediction models.通过风险预测模型将高维分子数据与事件发生时间终点相联系的技术概述。

Biom J. 2011 Mar;53(2):170-89. doi: 10.1002/bimj.201000152. Epub 2011 Feb 17.

Assessment of evaluation criteria for survival prediction from genomic data.基于基因组数据的生存预测评估标准的评估

Biom J. 2011 Mar;53(2):202-16. doi: 10.1002/bimj.201000048. Epub 2011 Feb 10.

引用本文的文献

High-Dimensional Variable Selection With Competing Events Using Cooperative Penalized Regression.使用协作惩罚回归进行具有竞争事件的高维变量选择

Biom J. 2025 Feb;67(1):e70036. doi: 10.1002/bimj.70036.

Tutorial on survival modeling with applications to omics data.生存分析建模教程及其在组学数据中的应用。

Bioinformatics. 2024 Mar 4;40(3). doi: 10.1093/bioinformatics/btae132.

Selection of variables for multivariable models: Opportunities and limitations in quantifying model stability by resampling.多变量模型变量的选择：通过重采样量化模型稳定性的机会和限制。

Stat Med. 2021 Jan 30;40(2):369-381. doi: 10.1002/sim.8779. Epub 2020 Oct 21.

Inferring transportation mode from smartphone sensors: Evaluating the potential of Wi-Fi and Bluetooth.从智能手机传感器推断交通方式：评估 Wi-Fi 和蓝牙的潜力。

PLoS One. 2020 Jul 2;15(7):e0234003. doi: 10.1371/journal.pone.0234003. eCollection 2020.

A multivariable approach for risk markers from pooled molecular data with only partial overlap.多变量方法分析仅部分重叠的汇集分子数据中的风险标志物。

BMC Med Genet. 2019 Jul 19;20(1):128. doi: 10.1186/s12881-019-0849-0.

Integrating multiple molecular sources into a clinical risk prediction signature by extracting complementary information.通过提取互补信息将多种分子来源整合到临床风险预测特征中。

BMC Bioinformatics. 2016 Aug 30;17(1):327. doi: 10.1186/s12859-016-1183-6.

Dealing with prognostic signature instability: a strategy illustrated for cardiovascular events in patients with end-stage renal disease.应对预后特征的不稳定性：以终末期肾病患者心血管事件为例的一种策略

BMC Med Genomics. 2016 Jul 20;9(1):43. doi: 10.1186/s12920-016-0210-9.

Identifying Prognostic SNPs in Clinical Cohorts: Complementing Univariate Analyses by Resampling and Multivariable Modeling.识别临床队列中的预后性单核苷酸多态性：通过重采样和多变量建模补充单变量分析

PLoS One. 2016 May 9;11(5):e0155226. doi: 10.1371/journal.pone.0155226. eCollection 2016.

Gene promoter methylation signature predicts survival of head and neck squamous cell carcinoma patients.基因启动子甲基化特征可预测头颈部鳞状细胞癌患者的生存率。

Epigenetics. 2016;11(1):61-73. doi: 10.1080/15592294.2015.1137414. Epub 2016 Jan 19.

A Metabolome-Wide Association Study of Kidney Function and Disease in the General Population.普通人群中肾功能与疾病的全代谢组关联研究。

J Am Soc Nephrol. 2016 Apr;27(4):1175-88. doi: 10.1681/ASN.2014111099. Epub 2015 Oct 8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

在高维自助抽样样本中针对有偏复杂度选择调整预测误差估计值。

Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献