Bates Stephen, Hastie Trevor, Tibshirani Robert
Depts. of Statistics and EECS, Univ. of California, Berkeley.
Depts. of Statistics and Biomedical Data Science, Stanford Univ.
J Am Stat Assoc. 2024;119(546):1434-1445. doi: 10.1080/01621459.2023.2197686. Epub 2023 May 15.
Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error of models fit on other unseen training sets drawn from the same population. We further show that this phenomenon occurs for most popular estimates of prediction error, including data splitting, bootstrapping, and Mallow's . Next, the standard confidence intervals for prediction error derived from cross-validation may have coverage far below the desired level. Because each data point is used for both training and testing, there are correlations among the measured accuracies for each fold, and so the usual estimate of variance is too small. We introduce a nested cross-validation scheme to estimate this variance more accurately, and show empirically that this modification leads to intervals with approximately correct coverage in many examples where traditional cross-validation intervals fail. Lastly, our analysis also shows that when producing confidence intervals for prediction accuracy with simple data splitting, one should re-fit the model on the combined data, since this invalidates the confidence intervals.
交叉验证是一种广泛用于估计预测误差的技术,但其行为复杂且尚未被完全理解。理想情况下,人们会认为交叉验证能估计手头模型针对训练数据拟合后的预测误差。我们证明,对于通过普通最小二乘法拟合的线性模型而言并非如此;相反,它估计的是在从同一总体中抽取的其他未见过的训练集上拟合的模型的平均预测误差。我们进一步表明,这种现象在大多数流行的预测误差估计方法中都会出现,包括数据分割、自助法和马洛斯Cp统计量。接下来,从交叉验证得出的预测误差的标准置信区间的覆盖范围可能远低于期望水平。由于每个数据点都用于训练和测试,各折的测量精度之间存在相关性,因此通常的方差估计过小。我们引入一种嵌套交叉验证方案来更准确地估计这种方差,并通过实证表明,在许多传统交叉验证区间失效的例子中,这种修改会使区间具有大致正确的覆盖范围。最后,我们的分析还表明,在使用简单数据分割生成预测精度的置信区间时,应该在合并数据上重新拟合模型,因为这会使置信区间无效。