使用学习曲线交叉验证进行快速而有效的模型选择。

Fast and Informative Model Selection Using Learning Curve Cross-Validation.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Aug;45(8):9669-9680. doi: 10.1109/TPAMI.2023.3251957. Epub 2023 Jun 30.

DOI:10.1109/TPAMI.2023.3251957

Abstract

Common cross-validation (CV) methods like k-fold cross-validation or Monte Carlo cross-validation estimate the predictive performance of a learner by repeatedly training it on a large portion of the given data and testing it on the remaining data. These techniques have two major drawbacks. First, they can be unnecessarily slow on large datasets. Second, beyond an estimation of the final performance, they give almost no insights into the learning process of the validated algorithm. In this article, we present a new approach for validation based on learning curves (LCCV). Instead of creating train-test splits with a large portion of training data, LCCV iteratively increases the number of instances used for training. In the context of model selection, it discards models that are unlikely to become competitive. In a series of experiments on 75 datasets, we could show that in over 90% of the cases using LCCV leads to the same performance as using 5/10-fold CV while substantially reducing the runtime (median runtime reductions of over 50%); the performance using LCCV never deviated from CV by more than 2.5%. We also compare it to a racing-based method and successive halving, a multi-armed bandit method. Additionally, it provides important insights, which for example allows assessing the benefits of acquiring more data.

摘要

常见的交叉验证（CV）方法，如 k 折交叉验证或蒙特卡罗交叉验证，通过在给定数据的大部分上反复训练学习者，并在剩余数据上测试来估计学习者的预测性能。这些技术有两个主要缺点。首先，它们在大型数据集上可能会非常缓慢。其次，除了最终性能的估计之外，它们几乎没有提供关于验证算法学习过程的任何见解。在本文中，我们提出了一种基于学习曲线（LCCV）的新验证方法。LCCV 不是使用大部分训练数据创建训练-测试分割，而是迭代地增加用于训练的实例数量。在模型选择的上下文中，它会丢弃不太可能具有竞争力的模型。在对 75 个数据集进行的一系列实验中，我们证明在超过 90%的情况下，使用 LCCV 导致的性能与使用 5/10 折 CV 相同，同时大大减少了运行时间（中位数运行时间减少超过 50%）；使用 LCCV 的性能从未偏离 CV 超过 2.5%。我们还将其与基于竞赛的方法和连续减半的多臂赌博机方法进行了比较。此外，它提供了重要的见解，例如可以评估获取更多数据的好处。