Wang Ruibo, Wang Yu, Li Jihong, Yang Xingli, Yang Jing
School of Software, Shanxi University, Taiyuan 030006, P.R.C.
School of Mathematical Sciences, Shanxi University, Taiyuan 030006, P.R.C.
Neural Comput. 2017 Feb;29(2):519-554. doi: 10.1162/NECO_a_00923. Epub 2016 Dec 28.
A cross-validation method based on [Formula: see text] replications of two-fold cross validation is called an [Formula: see text] cross validation. An [Formula: see text] cross validation is used in estimating the generalization error and comparing of algorithms' performance in machine learning. However, the variance of the estimator of the generalization error in [Formula: see text] cross validation is easily affected by random partitions. Poor data partitioning may cause a large fluctuation in the number of overlapping samples between any two training (test) sets in [Formula: see text] cross validation. This fluctuation results in a large variance in the [Formula: see text] cross-validated estimator. The influence of the random partitions on variance becomes serious as [Formula: see text] increases. Thus, in this study, the partitions with a restricted number of overlapping samples between any two training (test) sets are defined as a block-regularized partition set. The corresponding cross validation is called block-regularized [Formula: see text] cross validation ([Formula: see text] BCV). It can effectively reduce the influence of random partitions. We prove that the variance of the [Formula: see text] BCV estimator of the generalization error is smaller than the variance of [Formula: see text] cross-validated estimator and reaches the minimum in a special situation. An analytical expression of the variance can also be derived in this special situation. This conclusion is validated through simulation experiments. Furthermore, a practical construction method of [Formula: see text] BCV by a two-level orthogonal array is provided. Finally, a conservative estimator is proposed for the variance of estimator of the generalization error.
基于[公式:见原文]次两重交叉验证重复的交叉验证方法被称为[公式:见原文]交叉验证。[公式:见原文]交叉验证用于估计泛化误差并比较机器学习中算法的性能。然而,[公式:见原文]交叉验证中泛化误差估计器的方差很容易受到随机划分的影响。糟糕的数据划分可能会导致[公式:见原文]交叉验证中任意两个训练(测试)集之间重叠样本数量的大幅波动。这种波动会导致[公式:见原文]交叉验证估计器的方差很大。随着[公式:见原文]的增加,随机划分对方差的影响会变得更加严重。因此,在本研究中,将任意两个训练(测试)集之间重叠样本数量受限的划分定义为块正则化划分集。相应的交叉验证被称为块正则化[公式:见原文]交叉验证([公式:见原文]BCV)。它可以有效降低随机划分的影响。我们证明了泛化误差的[公式:见原文]BCV估计器的方差小于[公式:见原文]交叉验证估计器的方差,并且在特殊情况下达到最小值。在这种特殊情况下还可以推导出方差的解析表达式。这一结论通过模拟实验得到了验证。此外,还提供了一种通过二级正交阵列构建[公式:见原文]BCV的实用方法。最后,针对泛化误差估计器的方差提出了一种保守估计器。