Belabbas Mohamed-Ali, Wolfe Patrick J
Department of Statistics, School of Engineering and Applied Sciences, Oxford Street, Harvard University, Cambridge, MA 02138, USA.
Proc Natl Acad Sci U S A. 2009 Jan 13;106(2):369-74. doi: 10.1073/pnas.0810600105. Epub 2009 Jan 7.
Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these-based on sampling-leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach-based on sorting-provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods.
谱方法在统计学和机器学习中具有至关重要的地位,因为它们是从经典主成分分析到利用流形结构的最新方法等各类算法的基础。在大多数情况下,核心技术问题可归结为计算正定核的低秩近似。然而,对于处理超大规模或高维数据集的应用来说,精确谱分解所提供的最优近似成本过高,这是因为其复杂度与训练样本数量或其维度的立方成正比。受此类应用的推动,我们在此提出两种用于正定核近似的新算法,以及比文献结果更优的误差界。我们通过以高效方式确定相对于手头核近似任务而言数据中最具信息性的子集来解决此问题。这产生了基于Nyström方法的两种新策略,它们可直接应用于海量数据集。其中第一种基于采样,得到一种随机算法,在此算法中核在其划分集上诱导出一种概率分布,而后者基于排序,以确定性方式提供划分的选择。我们详细阐述它们的数值实现,并针对统计数据分析中的各种代表性问题给出模拟结果,每个结果都表明我们的方法相对于现有方法具有更好的性能。