对样本外预测进行自抽样以实现高效且准确的交叉验证。

Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation.

作者信息

Tsamardinos Ioannis, Greasidou Elissavet, Borboudakis Giorgos

机构信息

Computer Science Department, University of Crete and Gnosis Data Analysis PC, Heraklion, Greece.

出版信息

Mach Learn. 2018;107(12):1895-1922. doi: 10.1007/s10994-018-5714-4. Epub 2018 May 9.

DOI:10.1007/s10994-018-5714-4

PMID:30393425

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6191021/

Abstract

Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV's main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation (Varma and Simon in BMC Bioinform 7(1):91, 2006) and a method by Tibshirani and Tibshirani (Ann Appl Stat 822-829, 2009), BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based statistical criterion we stop training of models on new folds of inferior (with high probability) configurations. We name the method Bootstrap Bias Corrected with Dropping CV (BBCD-CV) that is both efficient and provides accurate performance estimates.

摘要

一般来说，交叉验证（CV）和样本外性能估计协议通常用于：（a）选择算法和超参数值的最优组合（称为配置）以生成最终预测模型；（b）估计最终模型的预测性能。然而，最佳配置的交叉验证性能存在乐观偏差。我们提出了一种有效的自举方法来校正这种偏差，称为自举偏差校正交叉验证（BBC-CV）。BBC-CV的主要思想是对在每个配置的样本外预测上选择性能最佳配置的整个过程进行自举，而无需对模型进行额外训练。与其他方法相比，即嵌套交叉验证（Varma和Simon，《BMC生物信息学》，2006年，7(1):91）以及Tibshirani和Tibshirani提出的一种方法（《应用统计年鉴》，822 - 829页，2009年），BBC-CV在计算上更高效，方差和偏差更小，并且适用于任何性能指标（准确率、AUC、一致性指数、均方误差）。随后，我们再次采用对样本外预测进行自举的思想来加速交叉验证过程。具体而言，使用基于自举的统计标准，我们停止在（大概率）较差配置的新折叠上对模型进行训练。我们将这种方法命名为带丢弃交叉验证的自举偏差校正（BBCD-CV），它既高效又能提供准确的性能估计。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2cb6/6191021/927371a4f56e/10994_2018_5714_Fig1_HTML.jpg

相似文献

Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation.

Mach Learn. 2018;107(12):1895-1922. doi: 10.1007/s10994-018-5714-4. Epub 2018 May 9.

Events per variable (EPV) and the relative performance of different strategies for estimating the out-of-sample validity of logistic regression models.

Stat Methods Med Res. 2017 Apr;26(2):796-808. doi: 10.1177/0962280214558972. Epub 2014 Nov 19.

Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation.

BMC Med Res Methodol. 2016 Oct 26;16(1):144. doi: 10.1186/s12874-016-0239-7.

Bias in error estimation when using cross-validation for model selection.

BMC Bioinformatics. 2006 Feb 23;7:91. doi: 10.1186/1471-2105-7-91.

Estimating misclassification error: a closer look at cross-validation based methods.

BMC Res Notes. 2012 Nov 28;5:656. doi: 10.1186/1756-0500-5-656.

Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study.

Anal Bioanal Chem. 2018 Sep;410(23):5981-5992. doi: 10.1007/s00216-018-1217-1. Epub 2018 Jun 29.

Empirical evaluation of internal validation methods for prediction in large-scale clinical data with rare-event outcomes: a case study in suicide risk prediction.

BMC Med Res Methodol. 2023 Feb 1;23(1):33. doi: 10.1186/s12874-023-01844-5.

Confidence intervals for the Cox model test error from cross-validation.

Stat Med. 2023 Nov 10;42(25):4532-4541. doi: 10.1002/sim.9873. Epub 2023 Aug 14.

Correcting the optimal resampling-based error rate by estimating the error rate of wrapper algorithms.

Biometrics. 2013 Sep;69(3):693-702. doi: 10.1111/biom.12041. Epub 2013 Jul 11.

A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization.

BMC Med Res Methodol. 2015 Nov 4;15:95. doi: 10.1186/s12874-015-0088-9.

引用本文的文献

Next-generation AI framework for comprehensive oral leukoplakia evaluation and management.

NPJ Digit Med. 2025 Aug 10;8(1):513. doi: 10.1038/s41746-025-01885-8.

Statistical variability in comparing accuracy of neuroimaging based classification models via cross validation.

Sci Rep. 2025 Aug 6;15(1):28745. doi: 10.1038/s41598-025-12026-2.

Prescriptive Predictors of Mindfulness Ecological Momentary Intervention for Social Anxiety Disorder: Machine Learning Analysis of Randomized Controlled Trial Data.

JMIR Ment Health. 2025 May 13;12:e67210. doi: 10.2196/67210.

Proximity to Golf Courses and Risk of Parkinson Disease.

JAMA Netw Open. 2025 May 1;8(5):e259198. doi: 10.1001/jamanetworkopen.2025.9198.

Application of in silico methods to predict the acute toxicity of bicyclic organophosphorus compounds as potential chemical weapon.

Arch Toxicol. 2025 Mar 7. doi: 10.1007/s00204-025-04000-8.

Contributions of connectional pathways to shaping Alzheimer's disease pathologies.

Brain Commun. 2025 Jan 6;7(1):fcae459. doi: 10.1093/braincomms/fcae459. eCollection 2025.

Classification of non-TCGA cancer samples to TCGA molecular subtypes using compact feature sets.

Cancer Cell. 2025 Feb 10;43(2):195-212.e11. doi: 10.1016/j.ccell.2024.12.002. Epub 2025 Jan 2.

TILTomorrow today: dynamic factors predicting changes in intracranial pressure treatment intensity after traumatic brain injury.

Sci Rep. 2025 Jan 2;15(1):95. doi: 10.1038/s41598-024-83862-x.

The constrained-disorder principle defines the functions of systems in nature.

Front Netw Physiol. 2024 Dec 18;4:1361915. doi: 10.3389/fnetp.2024.1361915. eCollection 2024.

Leveraging Machine Learning for Optimized Mechanical Properties and 3D Printing of PLA/cHAP for Bone Implant.

Biomimetics (Basel). 2024 Sep 27;9(10):587. doi: 10.3390/biomimetics9100587.

本文引用的文献

Toward Automatic Risk Assessment to Support Suicide Prevention.

Crisis. 2019 Jul;40(4):249-256. doi: 10.1027/0227-5910/a000561. Epub 2018 Nov 26.

MatureP: prediction of secreted proteins with exclusive information from their mature regions.

Sci Rep. 2017 Jun 12;7(1):3263. doi: 10.1038/s41598-017-03557-4.

Bias correction for selecting the minimal-error classifier from many machine learning models.

Bioinformatics. 2014 Nov 15;30(22):3152-8. doi: 10.1093/bioinformatics/btu520. Epub 2014 Aug 1.

Cross-validation pitfalls when selecting and assessing regression and classification models.

J Cheminform. 2014 Mar 29;6(1):10. doi: 10.1186/1758-2946-6-10.

Correcting the optimal resampling-based error rate by estimating the error rate of wrapper algorithms.

Biometrics. 2013 Sep;69(3):693-702. doi: 10.1111/biom.12041. Epub 2013 Jul 11.

Multiple-rule bias in the comparison of classification rules.

Bioinformatics. 2011 Jun 15;27(12):1675-83. doi: 10.1093/bioinformatics/btr262. Epub 2011 May 5.

Regularization Paths for Generalized Linear Models via Coordinate Descent.

J Stat Softw. 2010;33(1):1-22.

Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction.

BMC Med Res Methodol. 2009 Dec 21;9:85. doi: 10.1186/1471-2288-9-85.

Bias in error estimation when using cross-validation for model selection.

BMC Bioinformatics. 2006 Feb 23;7:91. doi: 10.1186/1471-2105-7-91.

A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis.

Bioinformatics. 2005 Mar 1;21(5):631-43. doi: 10.1093/bioinformatics/bti033. Epub 2004 Sep 16.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

对样本外预测进行自抽样以实现高效且准确的交叉验证。

Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献