Petersen Maya L, Molinaro Annette M, Sinisi Sandra E, van der Laan Mark J
Division of Biostatistics, University of California, Berkeley, School of Public Health, Earl Warren Hall 7360 Berkeley, California 94720-7360, phone: 510.642.3241 fax: 510.643.5163.
J Multivar Anal. 2008 Mar;25(2):260-266. doi: 10.1016/j.jmva.2007.07.004.
Many applications aim to learn a high dimensional parameter of a data generating distribution based on a sample of independent and identically distributed observations. For example, the goal might be to estimate the conditional mean of an outcome given a list of input variables. In this prediction context, bootstrap aggregating (bagging) has been introduced as a method to reduce the variance of a given estimator at little cost to bias. Bagging involves applying an estimator to multiple bootstrap samples, and averaging the result across bootstrap samples. In order to address the curse of dimensionality, a common practice has been to apply bagging to estimators which themselves use cross-validation, thereby using cross-validation within a bootstrap sample to select fine-tuning parameters trading off bias and variance of the bootstrap sample-specific candidate estimators. In this article we point out that in order to achieve the correct bias variance trade-off for the parameter of interest, one should apply the cross-validation selector externally to candidate bagged estimators indexed by these fine-tuning parameters. We use three simulations to compare the new cross-validated bagging method with bagging of cross-validated estimators and bagging of non-cross-validated estimators.
许多应用旨在基于独立同分布观测值的样本,学习数据生成分布的高维参数。例如,目标可能是在给定输入变量列表的情况下估计结果的条件均值。在这种预测背景下,自助聚合(装袋)已被引入作为一种以较小偏差代价降低给定估计器方差的方法。装袋涉及将估计器应用于多个自助样本,并对自助样本的结果进行平均。为了解决维度诅咒问题,一种常见的做法是将装袋应用于本身使用交叉验证的估计器,从而在自助样本内使用交叉验证来选择微调参数,以权衡自助样本特定候选估计器的偏差和方差。在本文中,我们指出,为了实现对感兴趣参数的正确偏差方差权衡,应该在外部将交叉验证选择器应用于由这些微调参数索引的候选装袋估计器。我们使用三个模拟来比较新的交叉验证装袋方法与交叉验证估计器的装袋以及非交叉验证估计器的装袋。