Wang Yulin, Yue Yang, Lu Rui, Han Yizeng, Song Shiji, Huang Gao
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8036-8055. doi: 10.1109/TPAMI.2024.3401036. Epub 2024 Nov 6.
The superior performance of modern computer vision backbones (e.g., vision Transformers learned on ImageNet-1 K/22 K) usually comes with a costly training procedure. This study contributes to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns within each example during training, instead of performing easier-to-harder sample selection. Our work is inspired by an intriguing observation on the learning dynamics of visual backbones: during the earlier stages of training, the model predominantly learns to recognize some 'easier-to-learn' discriminative patterns in the data. These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation. Motivated by these findings, we propose a curriculum where the model always leverages all the training data at every learning stage, yet the exposure to the 'easier-to-learn' patterns of each example is initiated first, with harder patterns gradually introduced as training progresses. To implement this idea in a computationally efficient way, we introduce a cropping operation in the Fourier spectrum of the inputs, enabling the model to learn from only the lower-frequency components. Then we show that exposing the contents of natural images can be readily achieved by modulating the intensity of data augmentation. Finally, we integrate these two aspects and design curriculum learning schedules by proposing tailored searching algorithms. Moreover, we present useful techniques for deploying our approach efficiently in challenging practical scenarios, such as large-scale parallel training, and limited input/output or data pre-processing speed. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective. As an off-the-shelf approach, it reduces the training time of various popular models (e.g., ResNet, ConvNeXt, DeiT, PVT, Swin, CSWin, and CAFormer) by [Formula: see text] on ImageNet-1 K/22 K without sacrificing accuracy. It also demonstrates efficacy in self-supervised learning (e.g., MAE).
现代计算机视觉主干网络(例如,在ImageNet-1K/22K上学习的视觉Transformer)的卓越性能通常伴随着昂贵的训练过程。本研究通过将课程学习的概念扩展到其原始形式之外,即使用从易到难的数据训练模型,来解决这个问题。具体来说,我们将训练课程重新表述为一个软选择函数,该函数在训练过程中逐步揭示每个示例中更难的模式,而不是进行从易到难的样本选择。我们的工作受到了关于视觉主干网络学习动态的一个有趣观察的启发:在训练的早期阶段,模型主要学习识别数据中一些“更容易学习”的判别模式。从频率和空间域观察这些模式时,它们包含低频分量以及无失真或数据增强的自然图像内容。受这些发现的启发,我们提出了一种课程,其中模型在每个学习阶段始终利用所有训练数据,但首先开始接触每个示例的“更容易学习”模式,随着训练的进行逐渐引入更难的模式。为了以计算高效的方式实现这个想法,我们在输入的傅里叶频谱中引入了裁剪操作,使模型能够仅从低频分量中学习。然后我们表明,通过调制数据增强的强度可以很容易地实现自然图像内容的曝光。最后,我们将这两个方面整合起来,并通过提出量身定制的搜索算法来设计课程学习时间表。此外,我们还提出了一些有用的技术,以便在具有挑战性的实际场景中有效地部署我们的方法,例如大规模并行训练以及有限的输入/输出或数据预处理速度。由此产生的方法EfficientTrain++简单、通用,但效果惊人。作为一种现成的方法,它在不牺牲准确性的情况下,将各种流行模型(例如ResNet、ConvNeXt、DeiT、PVT、Swin、CSWin和CAFormer)在ImageNet-1K/22K上的训练时间减少了[公式:见原文]。它在自监督学习(例如MAE)中也证明了有效性。