Suppr超能文献

快速多尺度邻域嵌入

Fast Multiscale Neighbor Embedding.

作者信息

de Bodt Cyril, Mulders Dounia, Verleysen Michel, Lee John Aldo

出版信息

IEEE Trans Neural Netw Learn Syst. 2022 Apr;33(4):1546-1560. doi: 10.1109/TNNLS.2020.3042807. Epub 2022 Apr 4.

Abstract

Dimension reduction (DR) computes faithful low-dimensional (LD) representations of high-dimensional (HD) data. Outstanding performances are achieved by recent neighbor embedding (NE) algorithms such as t -SNE, which mitigate the curse of dimensionality. The single-scale or multiscale nature of NE schemes drives the HD neighborhood preservation in the LD space (LDS). While single-scale methods focus on single-sized neighborhoods through the concept of perplexity, multiscale ones preserve neighborhoods in a broader range of sizes and account for the global HD organization to define the LDS. For both single-scale and multiscale methods, however, their time complexity in the number of samples is unaffordable for big data sets. Single-scale methods can be accelerated by relying on the inherent sparsity of the HD similarities they involve. On the other hand, the dense structure of the multiscale HD similarities prevents developing fast multiscale schemes in a similar way. This article addresses this difficulty by designing randomized accelerations of the multiscale methods. To account for all levels of interactions, the HD data are first subsampled at different scales, enabling to identify small and relevant neighbor sets for each data point thanks to vantage-point trees. Afterward, these sets are employed with a Barnes-Hut algorithm to cheaply evaluate the considered cost function and its gradient, enabling large-scale use of multiscale NE schemes. Extensive experiments demonstrate that the proposed accelerations are, statistically significantly, both faster than the original multiscale methods by orders of magnitude, and better preserving the HD neighborhoods than state-of-the-art single-scale schemes, leading to high-quality LD embeddings. Public codes are freely available at https://github.com/cdebodt.

摘要

降维(DR)计算高维(HD)数据的可靠低维(LD)表示。诸如t-SNE等最近邻嵌入(NE)算法取得了出色的性能,这些算法减轻了维度诅咒。NE方案的单尺度或多尺度性质推动了在低维空间(LDS)中保持高维邻域。单尺度方法通过困惑度概念专注于单一大小的邻域,而多尺度方法则在更广泛的大小范围内保持邻域,并考虑全局高维结构来定义LDS。然而,对于单尺度和多尺度方法,它们在样本数量上的时间复杂度对于大数据集来说都是难以承受的。单尺度方法可以通过依赖其所涉及的高维相似性的固有稀疏性来加速。另一方面,多尺度高维相似性的密集结构阻碍了以类似方式开发快速多尺度方案。本文通过设计多尺度方法的随机加速来解决这一难题。为了考虑所有层次的相互作用,首先在不同尺度下对高维数据进行子采样,借助antage-point树能够为每个数据点识别小的相关邻域集。之后,将这些邻域集与Barnes-Hut算法一起使用,以低成本评估所考虑的成本函数及其梯度,从而能够大规模使用多尺度NE方案。大量实验表明,所提出的加速方法在统计上比原始多尺度方法快几个数量级,并且比现有单尺度方案更好地保持高维邻域,从而生成高质量的低维嵌入。公开代码可在https://github.com/cdebodt免费获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验