在使用交叉验证进行模型选择时误差估计中的偏差。

Bias in error estimation when using cross-validation for model selection.

作者信息

Varma Sudhir, Simon Richard

机构信息

Biometric Research Branch, National Cancer Institute, Bethesda, MD, USA.

出版信息

BMC Bioinformatics. 2006 Feb 23;7:91. doi: 10.1186/1471-2105-7-91.

Abstract

BACKGROUND

Cross-validation (CV) is an effective method for estimating the prediction error of a classifier. Some recent articles have proposed methods for optimizing classifiers by choosing classifier parameter values that minimize the CV error estimate. We have evaluated the validity of using the CV error estimate of the optimized classifier as an estimate of the true error expected on independent data.

RESULTS

We used CV to optimize the classification parameters for two kinds of classifiers; Shrunken Centroids and Support Vector Machines (SVM). Random training datasets were created, with no difference in the distribution of the features between the two classes. Using these "null" datasets, we selected classifier parameter values that minimized the CV error estimate. 10-fold CV was used for Shrunken Centroids while Leave-One-Out-CV (LOOCV) was used for the SVM. Independent test data was created to estimate the true error. With "null" and "non null" (with differential expression between the classes) data, we also tested a nested CV procedure, where an inner CV loop is used to perform the tuning of the parameters while an outer CV is used to compute an estimate of the error. The CV error estimate for the classifier with the optimal parameters was found to be a substantially biased estimate of the true error that the classifier would incur on independent data. Even though there is no real difference between the two classes for the "null" datasets, the CV error estimate for the Shrunken Centroid with the optimal parameters was less than 30% on 18.5% of simulated training data-sets. For SVM with optimal parameters the estimated error rate was less than 30% on 38% of "null" data-sets. Performance of the optimized classifiers on the independent test set was no better than chance. The nested CV procedure reduces the bias considerably and gives an estimate of the error that is very close to that obtained on the independent testing set for both Shrunken Centroids and SVM classifiers for "null" and "non-null" data distributions.

CONCLUSION

We show that using CV to compute an error estimate for a classifier that has itself been tuned using CV gives a significantly biased estimate of the true error. Proper use of CV for estimating true error of a classifier developed using a well defined algorithm requires that all steps of the algorithm, including classifier parameter tuning, be repeated in each CV loop. A nested CV procedure provides an almost unbiased estimate of the true error.

摘要

背景

交叉验证(CV)是估计分类器预测误差的一种有效方法。最近的一些文章提出了通过选择能使CV误差估计最小化的分类器参数值来优化分类器的方法。我们评估了将优化后的分类器的CV误差估计用作对独立数据上预期真实误差的估计的有效性。

结果

我们使用CV来优化两种分类器的分类参数;收缩质心分类器和支持向量机(SVM)。创建了随机训练数据集,两类数据的特征分布没有差异。使用这些“空值”数据集,我们选择了能使CV误差估计最小化的分类器参数值。收缩质心分类器使用10折交叉验证,而支持向量机使用留一法交叉验证(LOOCV)。创建独立测试数据以估计真实误差。对于“空值”和“非空值”(两类之间存在差异表达)数据,我们还测试了一种嵌套交叉验证程序,其中内部交叉验证循环用于执行参数调整,而外部交叉验证用于计算误差估计。发现具有最优参数的分类器的CV误差估计是对该分类器在独立数据上会产生的真实误差的显著有偏估计。即使对于“空值”数据集,两类之间没有实际差异,但对于18.5%的模拟训练数据集,具有最优参数的收缩质心分类器的CV误差估计小于30%。对于具有最优参数的支持向量机,在38%的“空值”数据集上估计误差率小于30%。优化后的分类器在独立测试集上的性能并不比随机猜测好。嵌套交叉验证程序大大减少了偏差,并给出了一个与在独立测试集上针对收缩质心分类器和支持向量机分类器的“空值”和“非空值”数据分布所获得的误差非常接近的误差估计。

结论

我们表明,使用交叉验证来计算一个本身已通过交叉验证进行调优的分类器的误差估计会给出对真实误差显著有偏的估计。要正确使用交叉验证来估计使用定义良好的算法开发的分类器的真实误差,要求算法的所有步骤,包括分类器参数调整,在每个交叉验证循环中都重复进行。嵌套交叉验证程序提供了对真实误差几乎无偏的估计。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ca2/1397873/478bd55b43fc/1471-2105-7-91-1.jpg

引用本文的文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索