Suppr超能文献

分布偏差会影响留一法交叉验证。

Distributional bias compromises leave-one-out cross-validation.

作者信息

Austin George I, Pe'er Itsik, Korem Tal

出版信息

ArXiv. 2025 Mar 24:arXiv:2406.01652v2.

Abstract

Cross-validation is a common method for estimating the predictive performance of machine learning models. In a data-scarce regime, where one typically wishes to maximize the number of instances used for training the model, an approach called "leave-one-out cross-validation" is often used. In this design, a separate model is built for predicting each data instance after training on all other instances. Since this results in a single test instance available per model trained, predictions are aggregated across the entire dataset to calculate common performance metrics such as the area under the receiver operating characteristic or R2 scores. In this work, we demonstrate that this approach creates a negative correlation between the average label of each training fold and the label of its corresponding test instance, a phenomenon that we term distributional bias. As machine learning models tend to regress to the mean of their training data, this distributional bias tends to negatively impact performance evaluation and hyperparameter optimization. We show that this effect generalizes to leave-P-out cross-validation and persists across a wide range of modeling and evaluation approaches, and that it can lead to a bias against stronger regularization. To address this, we propose a generalizable rebalanced cross-validation approach that corrects for distributional bias for both classification and regression. We demonstrate that our approach improves cross-validation performance evaluation in synthetic simulations, across machine learning benchmarks, and in several published leave-one-out analyses.

摘要

交叉验证是估计机器学习模型预测性能的常用方法。在数据稀缺的情况下,人们通常希望最大化用于训练模型的实例数量,此时常使用一种称为“留一法交叉验证”的方法。在这种设计中,在对所有其他实例进行训练后,为预测每个数据实例构建一个单独的模型。由于每个训练模型只有一个可用的测试实例,因此要在整个数据集上汇总预测结果,以计算常见的性能指标,如受试者工作特征曲线下面积或R2分数。在这项工作中,我们证明了这种方法会在每个训练折的平均标签与其相应测试实例的标签之间产生负相关,我们将这种现象称为分布偏差。由于机器学习模型倾向于回归到其训练数据的均值,这种分布偏差往往会对性能评估和超参数优化产生负面影响。我们表明,这种效应适用于留P法交叉验证,并且在广泛的建模和评估方法中都存在,而且它可能导致对更强正则化的偏差。为了解决这个问题,我们提出了一种可推广的重新平衡交叉验证方法,该方法可以校正分类和回归中的分布偏差。我们证明,我们的方法在合成模拟、跨机器学习基准以及在一些已发表的留一法分析中都能提高交叉验证性能评估。

相似文献

2
Tournament leave-pair-out cross-validation for receiver operating characteristic analysis.基于留一法的受试者工作特征分析的比赛验证。
Stat Methods Med Res. 2019 Oct-Nov;28(10-11):2975-2991. doi: 10.1177/0962280218795190. Epub 2018 Aug 20.
5
Stratification bias in low signal microarray studies.低信号微阵列研究中的分层偏差。
BMC Bioinformatics. 2007 Sep 2;8:326. doi: 10.1186/1471-2105-8-326.
7
Issues in performance evaluation for host-pathogen protein interaction prediction.宿主-病原体蛋白质相互作用预测的性能评估问题
J Bioinform Comput Biol. 2016 Jun;14(3):1650011. doi: 10.1142/S0219720016500116. Epub 2016 Jan 14.

本文引用的文献

3
Large language models in medicine.医学中的大型语言模型。
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.
6
Navigating the pitfalls of applying machine learning in genomics.在基因组学中应用机器学习的陷阱。
Nat Rev Genet. 2022 Mar;23(3):169-181. doi: 10.1038/s41576-021-00434-9. Epub 2021 Nov 26.
8
The vaginal microbiome and preterm birth.阴道微生物组与早产。
Nat Med. 2019 Jun;25(6):1012-1021. doi: 10.1038/s41591-019-0450-2. Epub 2019 May 29.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验