Cervellera Cristiano, Maccio Danilo
IEEE Trans Neural Netw Learn Syst. 2018 Jul;29(7):2886-2895. doi: 10.1109/TNNLS.2017.2706964. Epub 2017 Jun 9.
The need for extracting a small sample from a large amount of real data, possibly streaming, arises routinely in learning problems, e.g., for storage, to cope with computational limitations, obtain good training/test/validation sets, and select minibatches for stochastic gradient neural network training. Unless we have reasons to select the samples in an active way dictated by the specific task and/or model at hand, it is important that the distribution of the selected points is as similar as possible to the original data. This is obvious for unsupervised learning problems, where the goal is to gain insights on the distribution of the data, but it is also relevant for supervised problems, where the theory explains how the training set distribution influences the generalization error. In this paper, we analyze the technique of stratified sampling from the point of view of distances between probabilities. This allows us to introduce an algorithm, based on recursive binary partition of the input space, aimed at obtaining samples that are distributed as much as possible as the original data. A theoretical analysis is proposed, proving the (greedy) optimality of the procedure together with explicit error bounds. An adaptive version of the algorithm is also introduced to cope with streaming data. Simulation tests on various data sets and different learning tasks are also provided.
在学习问题中,通常需要从大量可能是流式的真实数据中提取小样本,例如用于存储、应对计算限制、获得良好的训练/测试/验证集以及为随机梯度神经网络训练选择小批量样本。除非我们有理由根据手头的特定任务和/或模型以主动方式选择样本,否则所选点的分布应尽可能与原始数据相似,这一点很重要。对于无监督学习问题这是显而易见的,其目标是深入了解数据的分布,但对于监督问题也同样相关,在监督问题中,理论解释了训练集分布如何影响泛化误差。在本文中,我们从概率之间距离的角度分析分层抽样技术。这使我们能够引入一种基于输入空间递归二分法的算法,旨在获得尽可能与原始数据分布相同的样本。我们提出了一种理论分析,证明了该过程的(贪婪)最优性以及明确的误差界限。还引入了该算法的自适应版本以处理流式数据。此外,还提供了针对各种数据集和不同学习任务的模拟测试。