在使用交叉验证进行模型选择时误差估计中的偏差。

Bias in error estimation when using cross-validation for model selection.

作者信息

Varma Sudhir, Simon Richard

机构信息

Biometric Research Branch, National Cancer Institute, Bethesda, MD, USA.

出版信息

BMC Bioinformatics. 2006 Feb 23;7:91. doi: 10.1186/1471-2105-7-91.

DOI:10.1186/1471-2105-7-91

PMID:16504092

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1397873/

Abstract

BACKGROUND

Cross-validation (CV) is an effective method for estimating the prediction error of a classifier. Some recent articles have proposed methods for optimizing classifiers by choosing classifier parameter values that minimize the CV error estimate. We have evaluated the validity of using the CV error estimate of the optimized classifier as an estimate of the true error expected on independent data.

RESULTS

We used CV to optimize the classification parameters for two kinds of classifiers; Shrunken Centroids and Support Vector Machines (SVM). Random training datasets were created, with no difference in the distribution of the features between the two classes. Using these "null" datasets, we selected classifier parameter values that minimized the CV error estimate. 10-fold CV was used for Shrunken Centroids while Leave-One-Out-CV (LOOCV) was used for the SVM. Independent test data was created to estimate the true error. With "null" and "non null" (with differential expression between the classes) data, we also tested a nested CV procedure, where an inner CV loop is used to perform the tuning of the parameters while an outer CV is used to compute an estimate of the error. The CV error estimate for the classifier with the optimal parameters was found to be a substantially biased estimate of the true error that the classifier would incur on independent data. Even though there is no real difference between the two classes for the "null" datasets, the CV error estimate for the Shrunken Centroid with the optimal parameters was less than 30% on 18.5% of simulated training data-sets. For SVM with optimal parameters the estimated error rate was less than 30% on 38% of "null" data-sets. Performance of the optimized classifiers on the independent test set was no better than chance. The nested CV procedure reduces the bias considerably and gives an estimate of the error that is very close to that obtained on the independent testing set for both Shrunken Centroids and SVM classifiers for "null" and "non-null" data distributions.

CONCLUSION

We show that using CV to compute an error estimate for a classifier that has itself been tuned using CV gives a significantly biased estimate of the true error. Proper use of CV for estimating true error of a classifier developed using a well defined algorithm requires that all steps of the algorithm, including classifier parameter tuning, be repeated in each CV loop. A nested CV procedure provides an almost unbiased estimate of the true error.

摘要

背景

交叉验证（CV）是估计分类器预测误差的一种有效方法。最近的一些文章提出了通过选择能使CV误差估计最小化的分类器参数值来优化分类器的方法。我们评估了将优化后的分类器的CV误差估计用作对独立数据上预期真实误差的估计的有效性。

结果

我们使用CV来优化两种分类器的分类参数；收缩质心分类器和支持向量机（SVM）。创建了随机训练数据集，两类数据的特征分布没有差异。使用这些“空值”数据集，我们选择了能使CV误差估计最小化的分类器参数值。收缩质心分类器使用10折交叉验证，而支持向量机使用留一法交叉验证（LOOCV）。创建独立测试数据以估计真实误差。对于“空值”和“非空值”（两类之间存在差异表达）数据，我们还测试了一种嵌套交叉验证程序，其中内部交叉验证循环用于执行参数调整，而外部交叉验证用于计算误差估计。发现具有最优参数的分类器的CV误差估计是对该分类器在独立数据上会产生的真实误差的显著有偏估计。即使对于“空值”数据集，两类之间没有实际差异，但对于18.5%的模拟训练数据集，具有最优参数的收缩质心分类器的CV误差估计小于30%。对于具有最优参数的支持向量机，在38%的“空值”数据集上估计误差率小于30%。优化后的分类器在独立测试集上的性能并不比随机猜测好。嵌套交叉验证程序大大减少了偏差，并给出了一个与在独立测试集上针对收缩质心分类器和支持向量机分类器的“空值”和“非空值”数据分布所获得的误差非常接近的误差估计。

结论

我们表明，使用交叉验证来计算一个本身已通过交叉验证进行调优的分类器的误差估计会给出对真实误差显著有偏的估计。要正确使用交叉验证来估计使用定义良好的算法开发的分类器的真实误差，要求算法的所有步骤，包括分类器参数调整，在每个交叉验证循环中都重复进行。嵌套交叉验证程序提供了对真实误差几乎无偏的估计。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ca2/1397873/478bd55b43fc/1471-2105-7-91-1.jpg

相似文献

Bias in error estimation when using cross-validation for model selection.

BMC Bioinformatics. 2006 Feb 23;7:91. doi: 10.1186/1471-2105-7-91.

Classification based upon gene expression data: bias and precision of error rates.

Bioinformatics. 2007 Jun 1;23(11):1363-70. doi: 10.1093/bioinformatics/btm117. Epub 2007 Mar 28.

Prediction error estimation: a comparison of resampling methods.

Bioinformatics. 2005 Aug 1;21(15):3301-7. doi: 10.1093/bioinformatics/bti499. Epub 2005 May 19.

What should be expected from feature selection in small-sample settings.

Bioinformatics. 2006 Oct 1;22(19):2430-6. doi: 10.1093/bioinformatics/btl407. Epub 2006 Jul 26.

Improved centroids estimation for the nearest shrunken centroid classifier.

Bioinformatics. 2007 Apr 15;23(8):972-9. doi: 10.1093/bioinformatics/btm046. Epub 2007 Mar 24.

Classification with reject option in gene expression data.

Bioinformatics. 2008 Sep 1;24(17):1889-95. doi: 10.1093/bioinformatics/btn349. Epub 2008 Jul 10.

Optimal number of features as a function of sample size for various classification rules.

Bioinformatics. 2005 Apr 15;21(8):1509-15. doi: 10.1093/bioinformatics/bti171. Epub 2004 Nov 30.

Gene selection in cancer classification using sparse logistic regression with Bayesian regularization.

Bioinformatics. 2006 Oct 1;22(19):2348-55. doi: 10.1093/bioinformatics/btl386. Epub 2006 Jul 14.

Classification of microarrays to nearest centroids.

Bioinformatics. 2005 Nov 15;21(22):4148-54. doi: 10.1093/bioinformatics/bti681. Epub 2005 Sep 20.

Multiclass cancer classification by support vector machines with class-wise optimized genes and probability estimates.

J Theor Biol. 2009 Aug 7;259(3):533-40. doi: 10.1016/j.jtbi.2009.04.013. Epub 2009 May 3.

引用本文的文献

Machine learning- and multilayer molecular network-assisted screening hunts fentanyl compounds.

Sci Adv. 2025 Sep 5;11(36):eadw2799. doi: 10.1126/sciadv.adw2799.

Modeling the impact of social determinants on breast cancer screening: a data-driven approach.

Front Med (Lausanne). 2025 Aug 20;12:1644287. doi: 10.3389/fmed.2025.1644287. eCollection 2025.

An Explainable Radiomics-Based Classification Model for Sarcoma Diagnosis.

Diagnostics (Basel). 2025 Aug 20;15(16):2098. doi: 10.3390/diagnostics15162098.

Prediction of Mini-Mental State Examination Scores for Cognitive Impairment and Machine Learning Analysis of Oral Health and Demographic Data Among Individuals Older Than 60 Years: Cross-Sectional Study.

JMIR Med Inform. 2025 Aug 25;13:e75069. doi: 10.2196/75069.

Supervised machine learning applied in nursing notes for identifying the need of childhood cancer patients for psychosocial support.

Front Digit Health. 2025 Aug 7;7:1585309. doi: 10.3389/fdgth.2025.1585309. eCollection 2025.

Advancing fall risk prediction in older adults with cognitive frailty: A machine learning approach using 2-year clinical data.

PLoS One. 2025 Aug 21;20(8):e0330672. doi: 10.1371/journal.pone.0330672. eCollection 2025.

AI and Machine Learning Terminology in Medicine, Psychology, and Social Sciences: Tutorial and Practical Recommendations.

J Med Internet Res. 2025 Aug 18;27:e66100. doi: 10.2196/66100.

Raw-Data Driven Functional Data Analysis with Multi-Adaptive Functional Neural Networks for Ergonomic Risk Classification Using Facial and Bio-Signal Time-Series Data.

Sensors (Basel). 2025 Jul 23;25(15):4566. doi: 10.3390/s25154566.

Computational modelling reveals neurobiological contributions to static and dynamic functional connectivity patterns.

Front Comput Neurosci. 2025 Jul 29;19:1525785. doi: 10.3389/fncom.2025.1525785. eCollection 2025.

GGCRB: A Graph Neural Network Approach for Predicting CircRNA-RBP Interactions Using Structural and Sequence Features.

ACS Omega. 2025 Jul 22;10(30):33662-33674. doi: 10.1021/acsomega.5c04524. eCollection 2025 Aug 5.

本文引用的文献

Prediction error estimation: a comparison of resampling methods.

Bioinformatics. 2005 Aug 1;21(15):3301-7. doi: 10.1093/bioinformatics/bti499. Epub 2005 May 19.

Estimating misclassification error with small samples via bootstrap cross-validation.

Bioinformatics. 2005 May 1;21(9):1979-86. doi: 10.1093/bioinformatics/bti294. Epub 2005 Feb 2.

Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines.

FEBS Lett. 2003 Dec 4;555(2):358-62. doi: 10.1016/s0014-5793(03)01275-4.

Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection.

Lancet. 2003 Mar 15;361(9361):923-9. doi: 10.1016/S0140-6736(03)12775-4.

Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification.

J Natl Cancer Inst. 2003 Jan 1;95(1):14-8. doi: 10.1093/jnci/95.1.14.

Diagnosis of multiple cancer types by shrunken centroids of gene expression.

Proc Natl Acad Sci U S A. 2002 May 14;99(10):6567-72. doi: 10.1073/pnas.082099299.

Selection bias in gene extraction on the basis of microarray gene-expression data.

Proc Natl Acad Sci U S A. 2002 May 14;99(10):6562-6. doi: 10.1073/pnas.102102699. Epub 2002 Apr 30.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在使用交叉验证进行模型选择时误差估计中的偏差。

Bias in error estimation when using cross-validation for model selection.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献