eSPA+：针对小数据问题的可扩展熵最优机器学习分类方法

eSPA+: Scalable Entropy-Optimal Machine Learning Classification for Small Data Problems.

作者信息

Vecchi Edoardo, Pospíšil Lukáš, Albrecht Steffen, O'Kane Terence J, Horenko Illia

机构信息

Universitá della Svizzera Italiana, Faculty of Informatics, TI-6900 Lugano, Switzerland

VSB Ostrava, Department of Mathematics, Ludvika Podeste 1875/17 708 33 Ostrava, Czech Republic

出版信息

Neural Comput. 2022 Apr 15;34(5):1220-1255. doi: 10.1162/neco_a_01490.

DOI:10.1162/neco_a_01490

PMID:35344997

Abstract

Classification problems in the small data regime (with small data statistic T and relatively large feature space dimension D) impose challenges for the common machine learning (ML) and deep learning (DL) tools. The standard learning methods from these areas tend to show a lack of robustness when applied to data sets with significantly fewer data points than dimensions and quickly reach the overfitting bound, thus leading to poor performance beyond the training set. To tackle this issue, we propose eSPA+, a significant extension of the recently formulated entropy-optimal scalable probabilistic approximation algorithm (eSPA). Specifically, we propose to change the order of the optimization steps and replace the most computationally expensive subproblem of eSPA with its closed-form solution. We prove that with these two enhancements, eSPA+ moves from the polynomial to the linear class of complexity scaling algorithms. On several small data learning benchmarks, we show that the eSPA+ algorithm achieves a many-fold speed-up with respect to eSPA and even better performance results when compared to a wide array of ML and DL tools. In particular, we benchmark eSPA+ against the standard eSPA and the main classes of common learning algorithms in the small data regime: various forms of support vector machines, random forests, and long short-term memory algorithms. In all the considered applications, the common learning methods and eSPA are markedly outperformed by eSPA+, which achieves significantly higher prediction accuracy with an orders-of-magnitude lower computational cost.

摘要

小数据情况下（小数据统计量T且特征空间维度D相对较大）的分类问题给常见的机器学习（ML）和深度学习（DL）工具带来了挑战。当应用于数据点数量明显少于维度的数据集中时，这些领域的标准学习方法往往表现出缺乏稳健性，并且很快就会达到过拟合边界，从而导致在训练集之外的性能不佳。为了解决这个问题，我们提出了eSPA+，它是最近提出的熵最优可扩展概率近似算法（eSPA）的重大扩展。具体来说，我们建议改变优化步骤的顺序，并用其闭式解替换eSPA中计算成本最高的子问题。我们证明，通过这两个改进，eSPA+从多项式复杂度缩放算法类别转变为线性复杂度缩放算法类别。在几个小数据学习基准测试中，我们表明eSPA+算法相对于eSPA实现了数倍的加速，并且与众多ML和DL工具相比，性能结果甚至更好。特别是，我们将eSPA+与标准eSPA以及小数据情况下常见学习算法的主要类别进行了基准测试：各种形式的支持向量机、随机森林和长短期记忆算法。在所有考虑的应用中，eSPA+明显优于常见的学习方法和eSPA，它以低几个数量级的计算成本实现了显著更高的预测准确率。