Yu Ruonan, Liu Songhua, Wang Xinchao
IEEE Trans Pattern Anal Mach Intell. 2024 Jan;46(1):150-170. doi: 10.1109/TPAMI.2023.3323376. Epub 2023 Dec 5.
Recent success of deep learning is largely attributed to the sheer amount of data used for training deep neural networks. Despite the unprecedented success, the massive data, unfortunately, significantly increases the burden on storage and transmission and further gives rise to a cumbersome model training process. Besides, relying on the raw data for training per se yields concerns about privacy and copyright. To alleviate these shortcomings, dataset distillation (DD), also known as dataset condensation (DC), was introduced and has recently attracted much research attention in the community. Given an original dataset, DD aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. In this paper, we give a comprehensive review and summary of recent advances in DD and its application. We first introduce the task formally and propose an overall algorithmic framework followed by all existing DD methods. Next, we provide a systematic taxonomy of current methodologies in this area, and discuss their theoretical interconnections. We also present current challenges in DD through extensive empirical studies and envision possible directions for future works.
深度学习最近的成功很大程度上归功于用于训练深度神经网络的大量数据。尽管取得了前所未有的成功,但不幸的是,海量数据显著增加了存储和传输负担,并进一步导致模型训练过程繁琐。此外,仅依靠原始数据进行训练会引发对隐私和版权的担忧。为了缓解这些缺点,引入了数据集蒸馏(DD),也称为数据集压缩(DC),最近它在该领域引起了很多研究关注。给定一个原始数据集,DD旨在导出一个包含合成样本的小得多的数据集,基于该数据集训练的模型产生的性能与在原始数据集上训练的模型相当。在本文中,我们对DD及其应用的最新进展进行了全面回顾和总结。我们首先正式介绍该任务,并提出一个所有现有DD方法都遵循的总体算法框架。接下来,我们对该领域当前的方法进行系统分类,并讨论它们的理论联系。我们还通过广泛的实证研究展示了DD当前面临的挑战,并设想了未来工作可能的方向。