K R Madhavan, Kurban Hasan, Kulekci Oguzhan M, Dalkilic Mehmet M
Computer Science Department, Indiana University, Bloomington, Indiana, IN, USA.
College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.
Sci Rep. 2025 Jul 9;15(1):24788. doi: 10.1038/s41598-025-09856-5.
When k-Nearest-Neighbors ([Formula: see text]-NN) was conceived more than 70 years ago, computation, as we use it now, would be hardly recognizable. Since then, technology has improved by orders of magnitude, including unprecedented connectivity. However, [Formula: see text]-NN has remained virtually unchanged, exposing its shortcomings for today's needs: becoming overwhelmed when presented with large, high-dimensional data. Although space partitioning data structures, especially k-d trees and ball-trees, have improved performance in larger data, they remain inadequate when data is also high-dimensional. Experiments confirm that space partitioning becomes ineffective in high-dimensional data because most of the search space is explored needlessly. Our strategy is to partition the data into small groups of points similarly distanced from a reference point in a B+ tree data structure and use this data structure to limit the search space of a [Formula: see text]-NN query. Further, we establish that the limited search space chosen by the B+ tree structure can be effectively explored by any indexing techniques applicable to the entire data. We then present our algorithm [Formula: see text]-NN with partitioning (ti[Formula: see text]-NN), including computational analysis and experiments. Our detailed evaluation demonstrates significant speedup achieved by ti[Formula: see text]-NN over the naive, [Formula: see text]-d tree, [Formula: see text]-tree based [Formula: see text]-NN and other state-of-the-art approximate [Formula: see text]-NN search approaches in high dimensional data.
70多年前提出k近邻算法(k-Nearest-Neighbors,k-NN)时,我们现在所使用的计算方式几乎无法想象。从那时起,技术已经有了数量级的提升,包括前所未有的连接性。然而,k-NN算法几乎没有什么变化,暴露出其在满足当今需求方面的不足:面对大规模的高维数据时会不堪重负。尽管空间划分数据结构,特别是k-d树和球树,在处理更大规模数据时提高了性能,但在数据也是高维的情况下仍然不够。实验证实,在高维数据中空间划分变得无效,因为大部分搜索空间都被不必要地探索了。我们的策略是在B+树数据结构中将数据划分为与参考点距离相近的小数据点组,并使用这种数据结构来限制k-NN查询的搜索空间。此外,我们确定B+树结构选择的有限搜索空间可以通过适用于整个数据的任何索引技术有效地探索。然后我们提出了带划分的k-NN算法(ti k-NN),包括计算分析和实验。我们的详细评估表明,在高维数据中,ti k-NN相对于朴素的、基于k-d树、球树的k-NN以及其他最先进的近似k-NN搜索方法实现了显著的加速。