Maalouf Alaa, Jubran Ibrahim, Feldman Dan
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9977-9994. doi: 10.1109/TPAMI.2021.3139612. Epub 2022 Nov 7.
Least-mean-squares (LMS) solvers such as Linear / Ridge-Regression and SVD not only solve fundamental machine learning problems, but are also the building blocks in a variety of other methods, such as matrix factorizations. We suggest an algorithm that gets a finite set of n d-dimensional real vectors and returns a subset of d+1 vectors with positive weights whose weighted sum is exactly the same. The constructive proof in Caratheodory's Theorem computes such a subset in O(nd) time and thus not used in practice. Our algorithm computes this subset in O(nd+dlogn) time, using O(logn) calls to Caratheodory's construction on small but "smart" subsets. This is based on a novel paradigm of fusion between different data summarization techniques, known as sketches and coresets. For large values of d, we suggest a faster construction that takes O(nd) time and returns a weighted subset of O(d) sparsified input points. Here, a sparsified point means that some of its entries were set to zero. As an application, we show how to boost the performance of existing LMS solvers, such as those in scikit-learn library, up to x100. Generalization for streaming and distributed data is trivial. Extensive experimental results and open source code are provided.
最小均方(LMS)求解器,如线性/岭回归和奇异值分解(SVD),不仅能解决基本的机器学习问题,还是多种其他方法(如矩阵分解)的构建基础。我们提出一种算法,该算法接收一组有限的n个d维实向量,并返回d + 1个具有正权重的向量子集,其加权和完全相同。卡拉西奥多里定理中的构造性证明能在O(nd)时间内计算出这样一个子集,因此在实际中未被使用。我们的算法使用对小但“智能”的子集进行O(logn)次卡拉西奥多里构造调用,在O(nd + dlogn)时间内计算出这个子集。这基于一种不同数据汇总技术(称为草图和核心集)融合的新颖范式。对于较大的d值,我们提出一种更快的构造方法,该方法需要O(nd)时间,并返回O(d)个稀疏化输入点的加权子集。这里,稀疏化点意味着其某些条目被设置为零。作为一个应用,我们展示了如何将现有LMS求解器(如scikit-learn库中的求解器)的性能提高到100倍。对流数据和分布式数据的泛化很简单。我们提供了广泛的实验结果和开源代码。