Binder Harald, Schumacher Martin
Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg.
Stat Appl Genet Mol Biol. 2008;7(1):Article12. doi: 10.2202/1544-6115.1346. Epub 2008 Mar 14.
The bootstrap is a tool that allows for efficient evaluation of prediction performance of statistical techniques without having to set aside data for validation. This is especially important for high-dimensional data, e.g., arising from microarrays, because there the number of observations is often limited. For avoiding overoptimism the statistical technique to be evaluated has to be applied to every bootstrap sample in the same manner it would be used on new data. This includes a selection of complexity, e.g., the number of boosting steps for gradient boosting algorithms. Using the latter, we demonstrate in a simulation study that complexity selection in conventional bootstrap samples, drawn with replacement, is severely biased in many scenarios. This translates into a considerable bias of prediction error estimates, often underestimating the amount of information that can be extracted from high-dimensional data. Potential remedies for this complexity selection bias, such as alternatively using a fixed level of complexity or of using sampling without replacement are investigated and it is shown that the latter works well in many settings. We focus on high-dimensional binary response data, with bootstrap .632+ estimates of the Brier score for performance evaluation, and censored time-to-event data with .632+ prediction error curve estimates. The latter, with the modified bootstrap procedure, is then applied to an example with microarray data from patients with diffuse large B-cell lymphoma.
自助法是一种工具,它能够在无需留出数据用于验证的情况下,对统计技术的预测性能进行有效评估。这对于高维数据(例如源自微阵列的数据)尤为重要,因为在这类数据中观测值的数量往往有限。为避免过度乐观,待评估的统计技术必须以与应用于新数据相同的方式应用于每个自助样本。这包括复杂度的选择,例如梯度提升算法的提升步数。利用后者,我们在一项模拟研究中表明,在有放回抽取的传统自助样本中进行复杂度选择在许多情况下存在严重偏差。这会转化为预测误差估计的相当大偏差,常常低估可从高维数据中提取的信息量。我们研究了针对这种复杂度选择偏差的潜在补救措施,例如改用固定的复杂度水平或进行无放回抽样,结果表明后者在许多情况下效果良好。我们专注于高维二元响应数据,使用自助法.632 + 估计的布里尔分数进行性能评估,以及针对删失事件发生时间数据使用.632 + 预测误差曲线估计。然后,通过修改后的自助程序,将后者应用于一个来自弥漫性大B细胞淋巴瘤患者的微阵列数据示例。