Suppr超能文献

在高维自助抽样样本中针对有偏复杂度选择调整预测误差估计值。

Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples.

作者信息

Binder Harald, Schumacher Martin

机构信息

Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg.

出版信息

Stat Appl Genet Mol Biol. 2008;7(1):Article12. doi: 10.2202/1544-6115.1346. Epub 2008 Mar 14.

Abstract

The bootstrap is a tool that allows for efficient evaluation of prediction performance of statistical techniques without having to set aside data for validation. This is especially important for high-dimensional data, e.g., arising from microarrays, because there the number of observations is often limited. For avoiding overoptimism the statistical technique to be evaluated has to be applied to every bootstrap sample in the same manner it would be used on new data. This includes a selection of complexity, e.g., the number of boosting steps for gradient boosting algorithms. Using the latter, we demonstrate in a simulation study that complexity selection in conventional bootstrap samples, drawn with replacement, is severely biased in many scenarios. This translates into a considerable bias of prediction error estimates, often underestimating the amount of information that can be extracted from high-dimensional data. Potential remedies for this complexity selection bias, such as alternatively using a fixed level of complexity or of using sampling without replacement are investigated and it is shown that the latter works well in many settings. We focus on high-dimensional binary response data, with bootstrap .632+ estimates of the Brier score for performance evaluation, and censored time-to-event data with .632+ prediction error curve estimates. The latter, with the modified bootstrap procedure, is then applied to an example with microarray data from patients with diffuse large B-cell lymphoma.

摘要

自助法是一种工具,它能够在无需留出数据用于验证的情况下,对统计技术的预测性能进行有效评估。这对于高维数据(例如源自微阵列的数据)尤为重要,因为在这类数据中观测值的数量往往有限。为避免过度乐观,待评估的统计技术必须以与应用于新数据相同的方式应用于每个自助样本。这包括复杂度的选择,例如梯度提升算法的提升步数。利用后者,我们在一项模拟研究中表明,在有放回抽取的传统自助样本中进行复杂度选择在许多情况下存在严重偏差。这会转化为预测误差估计的相当大偏差,常常低估可从高维数据中提取的信息量。我们研究了针对这种复杂度选择偏差的潜在补救措施,例如改用固定的复杂度水平或进行无放回抽样,结果表明后者在许多情况下效果良好。我们专注于高维二元响应数据,使用自助法.632 + 估计的布里尔分数进行性能评估,以及针对删失事件发生时间数据使用.632 + 预测误差曲线估计。然后,通过修改后的自助程序,将后者应用于一个来自弥漫性大B细胞淋巴瘤患者的微阵列数据示例。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验