Ounpraseuth Songthip, Lensing Shelly Y, Spencer Horace J, Kodell Ralph L
Department of Biostatistics, University of Arkansas for Medical Sciences, 4301 W. Markham St. Slot 781, Little Rock, AR 72205, USA.
BMC Res Notes. 2012 Nov 28;5:656. doi: 10.1186/1756-0500-5-656.
To estimate a classifier's error in predicting future observations, bootstrap methods have been proposed as reduced-variation alternatives to traditional cross-validation (CV) methods based on sampling without replacement. Monte Carlo (MC) simulation studies aimed at estimating the true misclassification error conditional on the training set are commonly used to compare CV methods. We conducted an MC simulation study to compare a new method of bootstrap CV (BCV) to k-fold CV for estimating clasification error.
For the low-dimensional conditions simulated, the modest positive bias of k-fold CV contrasted sharply with the substantial negative bias of the new BCV method. This behavior was corroborated using a real-world dataset of prognostic gene-expression profiles in breast cancer patients. Our simulation results demonstrate some extreme characteristics of variance and bias that can occur due to a fault in the design of CV exercises aimed at estimating the true conditional error of a classifier, and that appear not to have been fully appreciated in previous studies. Although CV is a sound practice for estimating a classifier's generalization error, using CV to estimate the fixed misclassification error of a trained classifier conditional on the training set is problematic. While MC simulation of this estimation exercise can correctly represent the average bias of a classifier, it will overstate the between-run variance of the bias.
We recommend k-fold CV over the new BCV method for estimating a classifier's generalization error. The extreme negative bias of BCV is too high a price to pay for its reduced variance.
为了估计分类器在预测未来观测值时的误差,已提出自助法作为基于不放回抽样的传统交叉验证(CV)方法的低方差替代方法。旨在估计基于训练集的真实误分类误差的蒙特卡罗(MC)模拟研究通常用于比较CV方法。我们进行了一项MC模拟研究,以比较一种新的自助CV(BCV)方法和k折CV在估计分类误差方面的表现。
对于模拟的低维条件,k折CV的适度正偏差与新BCV方法的显著负偏差形成鲜明对比。这种现象在乳腺癌患者预后基因表达谱的真实世界数据集中得到了证实。我们的模拟结果表明,由于旨在估计分类器真实条件误差的CV练习设计存在缺陷,可能会出现一些方差和偏差的极端特征,而这些特征在以前的研究中似乎没有得到充分认识。虽然CV是估计分类器泛化误差的一种合理方法,但使用CV来估计基于训练集的训练好的分类器的固定误分类误差是有问题的。虽然对这种估计练习进行MC模拟可以正确地表示分类器的平均偏差,但它会高估偏差的运行间方差。
在估计分类器的泛化误差时,我们推荐使用k折CV而不是新的BCV方法。BCV的极端负偏差对于其降低的方差来说代价太高。