• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

数据集提炼:全面综述。

Dataset Distillation: A Comprehensive Review.

作者信息

Yu Ruonan, Liu Songhua, Wang Xinchao

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Jan;46(1):150-170. doi: 10.1109/TPAMI.2023.3323376. Epub 2023 Dec 5.

DOI:10.1109/TPAMI.2023.3323376
PMID:37815974
Abstract

Recent success of deep learning is largely attributed to the sheer amount of data used for training deep neural networks. Despite the unprecedented success, the massive data, unfortunately, significantly increases the burden on storage and transmission and further gives rise to a cumbersome model training process. Besides, relying on the raw data for training per se yields concerns about privacy and copyright. To alleviate these shortcomings, dataset distillation (DD), also known as dataset condensation (DC), was introduced and has recently attracted much research attention in the community. Given an original dataset, DD aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. In this paper, we give a comprehensive review and summary of recent advances in DD and its application. We first introduce the task formally and propose an overall algorithmic framework followed by all existing DD methods. Next, we provide a systematic taxonomy of current methodologies in this area, and discuss their theoretical interconnections. We also present current challenges in DD through extensive empirical studies and envision possible directions for future works.

摘要

深度学习最近的成功很大程度上归功于用于训练深度神经网络的大量数据。尽管取得了前所未有的成功,但不幸的是,海量数据显著增加了存储和传输负担,并进一步导致模型训练过程繁琐。此外,仅依靠原始数据进行训练会引发对隐私和版权的担忧。为了缓解这些缺点,引入了数据集蒸馏(DD),也称为数据集压缩(DC),最近它在该领域引起了很多研究关注。给定一个原始数据集,DD旨在导出一个包含合成样本的小得多的数据集,基于该数据集训练的模型产生的性能与在原始数据集上训练的模型相当。在本文中,我们对DD及其应用的最新进展进行了全面回顾和总结。我们首先正式介绍该任务,并提出一个所有现有DD方法都遵循的总体算法框架。接下来,我们对该领域当前的方法进行系统分类,并讨论它们的理论联系。我们还通过广泛的实证研究展示了DD当前面临的挑战,并设想了未来工作可能的方向。

相似文献

1
Dataset Distillation: A Comprehensive Review.数据集提炼:全面综述。
IEEE Trans Pattern Anal Mach Intell. 2024 Jan;46(1):150-170. doi: 10.1109/TPAMI.2023.3323376. Epub 2023 Dec 5.
2
Importance-aware adaptive dataset distillation.重要感知自适应数据集蒸馏。
Neural Netw. 2024 Apr;172:106154. doi: 10.1016/j.neunet.2024.106154. Epub 2024 Jan 29.
3
A Comprehensive Survey of Dataset Distillation.数据集提炼的全面调查。
IEEE Trans Pattern Anal Mach Intell. 2024 Jan;46(1):17-32. doi: 10.1109/TPAMI.2023.3322540. Epub 2023 Dec 5.
4
Dataset Condensation via Expert Subspace Projection.通过专家子空间投影实现数据集压缩
Sensors (Basel). 2023 Sep 28;23(19):8148. doi: 10.3390/s23198148.
5
Mitigating carbon footprint for knowledge distillation based deep learning model compression.减轻基于知识蒸馏的深度学习模型压缩的碳足迹。
PLoS One. 2023 May 15;18(5):e0285668. doi: 10.1371/journal.pone.0285668. eCollection 2023.
6
TEM virus images: Benchmark dataset and deep learning classification.TEM 病毒图像:基准数据集和深度学习分类。
Comput Methods Programs Biomed. 2021 Sep;209:106318. doi: 10.1016/j.cmpb.2021.106318. Epub 2021 Jul 29.
7
Compressed gastric image generation based on soft-label dataset distillation for medical data sharing.基于软标签数据集精馏的压缩胃影像生成用于医疗数据共享。
Comput Methods Programs Biomed. 2022 Dec;227:107189. doi: 10.1016/j.cmpb.2022.107189. Epub 2022 Oct 22.
8
9
Continual learning with attentive recurrent neural networks for temporal data classification.用于时态数据分类的基于注意力循环神经网络的持续学习
Neural Netw. 2023 Jan;158:171-187. doi: 10.1016/j.neunet.2022.10.031. Epub 2022 Nov 11.
10
Self-supervised learning with self-distillation on COVID-19 medical image classification.基于自蒸馏的 COVID-19 医学图像分类的自监督学习。
Comput Methods Programs Biomed. 2024 Jan;243:107876. doi: 10.1016/j.cmpb.2023.107876. Epub 2023 Oct 18.

引用本文的文献

1
A decentralised architecture for secure exchange of assets in data spaces: The case of SEDIMARK.一种用于数据空间中资产安全交换的去中心化架构:以SEDIMARK为例。
Data Brief. 2025 Jun 10;61:111757. doi: 10.1016/j.dib.2025.111757. eCollection 2025 Aug.
2
Concurrent photocatalytic degradation of organic pollutants using smart magnetically cellulose-based metal organic framework nanocomposite.使用智能磁性纤维素基金属有机框架纳米复合材料同时光催化降解有机污染物
Sci Rep. 2025 Jun 20;15(1):20100. doi: 10.1038/s41598-025-03256-5.
3
A new dataset for measuring the performance of blood vessel segmentation methods under distribution shifts.
一个用于在分布偏移情况下测量血管分割方法性能的新数据集。
PLoS One. 2025 May 27;20(5):e0322048. doi: 10.1371/journal.pone.0322048. eCollection 2025.
4
Condensation of Data and Knowledge for Network Traffic Classification: Techniques, Applications, and Open Issues.用于网络流量分类的数据与知识融合:技术、应用及开放问题
Sensors (Basel). 2025 Apr 8;25(8):2368. doi: 10.3390/s25082368.
5
Model interpretability on private-safe oriented student dropout prediction.面向隐私安全的学生辍学预测的模型可解释性
PLoS One. 2025 Mar 31;20(3):e0317726. doi: 10.1371/journal.pone.0317726. eCollection 2025.
6
Deep-learning-ready RGB-depth images of seedling development.用于深度学习的幼苗发育的RGB深度图像。
Plant Methods. 2025 Feb 11;21(1):16. doi: 10.1186/s13007-025-01334-3.
7
Data free knowledge distillation with feature synthesis and spatial consistency for image analysis.用于图像分析的基于特征合成和空间一致性的数据无关知识蒸馏
Sci Rep. 2024 Nov 11;14(1):27557. doi: 10.1038/s41598-024-78757-w.
8
Multi-Source Feature-Fusion Method for the Seismic Data of Cultural Relics Based on Deep Learning.基于深度学习的文物地震数据多源特征融合方法
Sensors (Basel). 2024 Jul 12;24(14):4525. doi: 10.3390/s24144525.
9
Data Valuation with Gradient Similarity.基于梯度相似性的数据评估
ArXiv. 2024 May 13:arXiv:2405.08217v1.
10
Machine Learning-Guided Protein Engineering.机器学习引导的蛋白质工程
ACS Catal. 2023 Oct 13;13(21):13863-13895. doi: 10.1021/acscatal.3c02743. eCollection 2023 Nov 3.