Abhishek Kumar, Brown Colin J, Hamarneh Ghassan
School of Computing Science, Simon Fraser University, 8888 University Drive, Burnaby, V5A 1S6 Canada.
Engineering, Hinge Health, 455 Market Street, Suite 700, San Francisco, 94105 USA.
J Big Data. 2024;11(1):43. doi: 10.1186/s40537-024-00898-6. Epub 2024 Mar 23.
Modern deep learning training procedures rely on model regularization techniques such as data augmentation methods, which generate training samples that increase the diversity of data and richness of label information. A popular recent method, , uses convex combinations of pairs of original samples to generate new samples. However, as we show in our experiments, can produce undesirable synthetic samples, where the data is sampled off the manifold and can contain incorrect labels. We propose -, a generalization of with provably and demonstrably desirable properties that allows convex combinations of samples, leading to more realistic and diverse outputs that incorporate information from original samples by using a -series interpolant. We show that, compared to , - better preserves the intrinsic dimensionality of the original datasets, which is a desirable property for training generalizable models. Furthermore, we show that our implementation of - is faster than , and extensive evaluation on controlled synthetic and 26 diverse real-world natural and medical image classification datasets shows that - outperforms , CutMix, and traditional data augmentation techniques. The code will be released at https://github.com/kakumarabhishek/zeta-mixup.
现代深度学习训练方法依赖于模型正则化技术,如数据增强方法,这些方法生成训练样本,增加了数据的多样性和标签信息的丰富性。最近一种流行的方法,使用原始样本对的凸组合来生成新样本。然而,正如我们在实验中所示,可能会产生不良的合成样本,其中数据是从流形上采样得到的,并且可能包含错误标签。我们提出了Zeta Mixup,它是对的一种推广,具有可证明和明显理想的属性,允许对样本进行凸组合,通过使用Zeta系列插值器,从而产生更真实、更多样化的输出,这些输出融合了来自原始样本的信息。我们表明,与相比,Zeta Mixup能更好地保留原始数据集的内在维度,这是训练可泛化模型的一个理想属性。此外,我们表明我们实现的Zeta Mixup比更快,并且在受控合成以及26个不同的真实世界自然和医学图像分类数据集上的广泛评估表明,Zeta Mixup优于、CutMix和传统数据增强技术。代码将在https://github.com/kakumarabhishek/zeta-mixup上发布。