Austin George I, Pe'er Itsik, Korem Tal
ArXiv. 2025 Mar 24:arXiv:2406.01652v2.
Cross-validation is a common method for estimating the predictive performance of machine learning models. In a data-scarce regime, where one typically wishes to maximize the number of instances used for training the model, an approach called "leave-one-out cross-validation" is often used. In this design, a separate model is built for predicting each data instance after training on all other instances. Since this results in a single test instance available per model trained, predictions are aggregated across the entire dataset to calculate common performance metrics such as the area under the receiver operating characteristic or R2 scores. In this work, we demonstrate that this approach creates a negative correlation between the average label of each training fold and the label of its corresponding test instance, a phenomenon that we term distributional bias. As machine learning models tend to regress to the mean of their training data, this distributional bias tends to negatively impact performance evaluation and hyperparameter optimization. We show that this effect generalizes to leave-P-out cross-validation and persists across a wide range of modeling and evaluation approaches, and that it can lead to a bias against stronger regularization. To address this, we propose a generalizable rebalanced cross-validation approach that corrects for distributional bias for both classification and regression. We demonstrate that our approach improves cross-validation performance evaluation in synthetic simulations, across machine learning benchmarks, and in several published leave-one-out analyses.
交叉验证是估计机器学习模型预测性能的常用方法。在数据稀缺的情况下,人们通常希望最大化用于训练模型的实例数量,此时常使用一种称为“留一法交叉验证”的方法。在这种设计中,在对所有其他实例进行训练后,为预测每个数据实例构建一个单独的模型。由于每个训练模型只有一个可用的测试实例,因此要在整个数据集上汇总预测结果,以计算常见的性能指标,如受试者工作特征曲线下面积或R2分数。在这项工作中,我们证明了这种方法会在每个训练折的平均标签与其相应测试实例的标签之间产生负相关,我们将这种现象称为分布偏差。由于机器学习模型倾向于回归到其训练数据的均值,这种分布偏差往往会对性能评估和超参数优化产生负面影响。我们表明,这种效应适用于留P法交叉验证,并且在广泛的建模和评估方法中都存在,而且它可能导致对更强正则化的偏差。为了解决这个问题,我们提出了一种可推广的重新平衡交叉验证方法,该方法可以校正分类和回归中的分布偏差。我们证明,我们的方法在合成模拟、跨机器学习基准以及在一些已发表的留一法分析中都能提高交叉验证性能评估。