Suppr超能文献

机器学习中的谱方法及针对超大型数据集的新策略。

Spectral methods in machine learning and new strategies for very large datasets.

作者信息

Belabbas Mohamed-Ali, Wolfe Patrick J

机构信息

Department of Statistics, School of Engineering and Applied Sciences, Oxford Street, Harvard University, Cambridge, MA 02138, USA.

出版信息

Proc Natl Acad Sci U S A. 2009 Jan 13;106(2):369-74. doi: 10.1073/pnas.0810600105. Epub 2009 Jan 7.

Abstract

Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these-based on sampling-leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach-based on sorting-provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods.

摘要

谱方法在统计学和机器学习中具有至关重要的地位,因为它们是从经典主成分分析到利用流形结构的最新方法等各类算法的基础。在大多数情况下,核心技术问题可归结为计算正定核的低秩近似。然而,对于处理超大规模或高维数据集的应用来说,精确谱分解所提供的最优近似成本过高,这是因为其复杂度与训练样本数量或其维度的立方成正比。受此类应用的推动,我们在此提出两种用于正定核近似的新算法,以及比文献结果更优的误差界。我们通过以高效方式确定相对于手头核近似任务而言数据中最具信息性的子集来解决此问题。这产生了基于Nyström方法的两种新策略,它们可直接应用于海量数据集。其中第一种基于采样,得到一种随机算法,在此算法中核在其划分集上诱导出一种概率分布,而后者基于排序,以确定性方式提供划分的选择。我们详细阐述它们的数值实现,并针对统计数据分析中的各种代表性问题给出模拟结果,每个结果都表明我们的方法相对于现有方法具有更好的性能。

相似文献

1
5
On landmark selection and sampling in high-dimensional data analysis.关于高维数据分析中的地标选择和采样。
Philos Trans A Math Phys Eng Sci. 2009 Nov 13;367(1906):4295-312. doi: 10.1098/rsta.2009.0161.
6
Kernel entropy component analysis.核熵分量分析。
IEEE Trans Pattern Anal Mach Intell. 2010 May;32(5):847-60. doi: 10.1109/TPAMI.2009.100.
8
Reduced support vector machines: a statistical theory.简约支持向量机:一种统计理论。
IEEE Trans Neural Netw. 2007 Jan;18(1):1-13. doi: 10.1109/TNN.2006.883722.
9
Randomized algorithms for large-scale dictionary learning.大规模字典学习的随机算法。
Neural Netw. 2024 Nov;179:106628. doi: 10.1016/j.neunet.2024.106628. Epub 2024 Aug 10.

本文引用的文献

1
Randomized algorithms for the low-rank approximation of matrices.矩阵低秩逼近的随机算法。
Proc Natl Acad Sci U S A. 2007 Dec 18;104(51):20167-72. doi: 10.1073/pnas.0709640104. Epub 2007 Dec 4.
2
Hessian eigenmaps: locally linear embedding techniques for high-dimensional data.黑森特征映射:用于高维数据的局部线性嵌入技术。
Proc Natl Acad Sci U S A. 2003 May 13;100(10):5591-6. doi: 10.1073/pnas.1031596100. Epub 2003 Apr 30.
4
Spectral grouping using the Nyström method.使用Nyström方法进行谱分组。
IEEE Trans Pattern Anal Mach Intell. 2004 Feb;26(2):214-25. doi: 10.1109/TPAMI.2004.1262185.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验