Jiang Wenyu, Simon Richard
Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, 6130 Executive Boulevard, Rockville, MD 20852, USA.
Stat Med. 2007 Dec 20;26(29):5320-34. doi: 10.1002/sim.2968.
This paper first provides a critical review on some existing methods for estimating the prediction error in classifying microarray data where the number of genes greatly exceeds the number of specimens. Special attention is given to the bootstrap-related methods. When the sample size n is small, we find that all the reviewed methods suffer from either substantial bias or variability. We introduce a repeated leave-one-out bootstrap (RLOOB) method that predicts for each specimen in the sample using bootstrap learning sets of size ln. We then propose an adjusted bootstrap (ABS) method that fits a learning curve to the RLOOB estimates calculated with different bootstrap learning set sizes. The ABS method is robust across the situations we investigate and provides a slightly conservative estimate for the prediction error. Even with small samples, it does not suffer from large upward bias as the leave-one-out bootstrap and the 0.632+ bootstrap, and it does not suffer from large variability as the leave-one-out cross-validation in microarray applications.
本文首先对一些现有方法进行了批判性综述,这些方法用于估计在基因数量大大超过样本数量的情况下对微阵列数据进行分类时的预测误差。特别关注了与自助法相关的方法。当样本量n较小时,我们发现所有综述的方法都存在显著偏差或变异性。我们引入了一种重复留一法自助法(RLOOB),该方法使用大小为ln的自助学习集对样本中的每个样本进行预测。然后,我们提出了一种调整后的自助法(ABS),该方法对使用不同自助学习集大小计算出的RLOOB估计值拟合一条学习曲线。ABS方法在我们研究的各种情况下都很稳健,并为预测误差提供了一个略为保守的估计。即使样本量较小,它也不像留一法自助法和0.632+自助法那样存在较大的向上偏差,也不像微阵列应用中的留一法交叉验证那样存在较大的变异性。