• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于学习问题的保持分布的分层抽样

Distribution-Preserving Stratified Sampling for Learning Problems.

作者信息

Cervellera Cristiano, Maccio Danilo

出版信息

IEEE Trans Neural Netw Learn Syst. 2018 Jul;29(7):2886-2895. doi: 10.1109/TNNLS.2017.2706964. Epub 2017 Jun 9.

DOI:10.1109/TNNLS.2017.2706964
PMID:28613186
Abstract

The need for extracting a small sample from a large amount of real data, possibly streaming, arises routinely in learning problems, e.g., for storage, to cope with computational limitations, obtain good training/test/validation sets, and select minibatches for stochastic gradient neural network training. Unless we have reasons to select the samples in an active way dictated by the specific task and/or model at hand, it is important that the distribution of the selected points is as similar as possible to the original data. This is obvious for unsupervised learning problems, where the goal is to gain insights on the distribution of the data, but it is also relevant for supervised problems, where the theory explains how the training set distribution influences the generalization error. In this paper, we analyze the technique of stratified sampling from the point of view of distances between probabilities. This allows us to introduce an algorithm, based on recursive binary partition of the input space, aimed at obtaining samples that are distributed as much as possible as the original data. A theoretical analysis is proposed, proving the (greedy) optimality of the procedure together with explicit error bounds. An adaptive version of the algorithm is also introduced to cope with streaming data. Simulation tests on various data sets and different learning tasks are also provided.

摘要

在学习问题中,通常需要从大量可能是流式的真实数据中提取小样本,例如用于存储、应对计算限制、获得良好的训练/测试/验证集以及为随机梯度神经网络训练选择小批量样本。除非我们有理由根据手头的特定任务和/或模型以主动方式选择样本,否则所选点的分布应尽可能与原始数据相似,这一点很重要。对于无监督学习问题这是显而易见的,其目标是深入了解数据的分布,但对于监督问题也同样相关,在监督问题中,理论解释了训练集分布如何影响泛化误差。在本文中,我们从概率之间距离的角度分析分层抽样技术。这使我们能够引入一种基于输入空间递归二分法的算法,旨在获得尽可能与原始数据分布相同的样本。我们提出了一种理论分析,证明了该过程的(贪婪)最优性以及明确的误差界限。还引入了该算法的自适应版本以处理流式数据。此外,还提供了针对各种数据集和不同学习任务的模拟测试。

相似文献

1
Distribution-Preserving Stratified Sampling for Learning Problems.用于学习问题的保持分布的分层抽样
IEEE Trans Neural Netw Learn Syst. 2018 Jul;29(7):2886-2895. doi: 10.1109/TNNLS.2017.2706964. Epub 2017 Jun 9.
2
On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning.关于划分训练集和验证集:交叉验证、自助法和系统抽样在估计监督学习泛化性能方面的比较研究
J Anal Test. 2018;2(3):249-262. doi: 10.1007/s41664-018-0068-2. Epub 2018 Oct 29.
3
Using unsupervised analysis to constrain generalization bounds for support vector classifiers.使用无监督分析来约束支持向量分类器的泛化边界。
IEEE Trans Neural Netw. 2010 Mar;21(3):424-38. doi: 10.1109/TNN.2009.2038695. Epub 2010 Jan 29.
4
An Efficient Sampling-Based Algorithms Using Active Learning and Manifold Learning for Multiple Unmanned Aerial Vehicle Task Allocation under Uncertainty.基于主动学习和流形学习的高效采样算法在不确定环境下的多架无人机任务分配。
Sensors (Basel). 2018 Aug 12;18(8):2645. doi: 10.3390/s18082645.
5
One-pass-throw-away learning for cybersecurity in streaming non-stationary environments by dynamic stratum network.基于动态分层网络的流媒体非平稳环境中用于网络安全的单次丢弃学习。
PLoS One. 2018 Sep 6;13(9):e0202937. doi: 10.1371/journal.pone.0202937. eCollection 2018.
6
Novel maximum-margin training algorithms for supervised neural networks.用于监督神经网络的新型最大间隔训练算法。
IEEE Trans Neural Netw. 2010 Jun;21(6):972-84. doi: 10.1109/TNN.2010.2046423. Epub 2010 Apr 19.
7
A theory of local learning, the learning channel, and the optimality of backpropagation.一种关于局部学习、学习通道及反向传播最优性的理论。
Neural Netw. 2016 Nov;83:51-74. doi: 10.1016/j.neunet.2016.07.006. Epub 2016 Aug 5.
8
Stability-Based Generalization Analysis of Distributed Learning Algorithms for Big Data.大数据分布式学习算法基于稳定性的泛化分析
IEEE Trans Neural Netw Learn Syst. 2020 Mar;31(3):801-812. doi: 10.1109/TNNLS.2019.2910188. Epub 2019 May 8.
9
Channel selection and classification of electroencephalogram signals: an artificial neural network and genetic algorithm-based approach.脑电信号的通道选择与分类:基于人工神经网络和遗传算法的方法。
Artif Intell Med. 2012 Jun;55(2):117-26. doi: 10.1016/j.artmed.2012.02.001. Epub 2012 Apr 12.
10
Visual Recognition by Learning From Web Data via Weakly Supervised Domain Generalization.通过弱监督的域泛化从网络数据中学习视觉识别。
IEEE Trans Neural Netw Learn Syst. 2017 Sep;28(9):1985-1999. doi: 10.1109/TNNLS.2016.2557349. Epub 2016 Jun 1.

引用本文的文献

1
A new ensemble learning method stratified sampling blending optimizes conventional blending and improves prediction performance.一种新的集成学习方法——分层抽样融合优化了传统融合方法并提高了预测性能。
Bioinform Adv. 2025 Feb 22;5(1):vbaf002. doi: 10.1093/bioadv/vbaf002. eCollection 2025.
2
External validation of the prediction model of intradialytic hypotension: a multicenter prospective cohort study.透析中低血压预测模型的外部验证:一项多中心前瞻性队列研究
Ren Fail. 2024 Dec;46(1):2322031. doi: 10.1080/0886022X.2024.2322031. Epub 2024 Mar 11.