Guzun Gheorghi, Canahuate Guadalupe
Department of Computer Engineering, San Jose State University, San Jose, CA, USA.
Department of Electrical and Computer Engineering, The University of Iowa, Iowa, IA, USA.
Distrib Parallel Databases. 2020;38:255-286. doi: 10.1007/s10619-019-07266-x. Epub 2019 Apr 11.
The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic and there is potential to improve accuracy when a query-dependent quantization is used. In this work we propose a query dependent equi-depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distancemetric,but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions. Furthermore, similarity searches with QED show linear or better scalability in relation to the number of dimensions, and the number of compute nodes.
相似性概念被用作许多数据探索和数据挖掘任务的基础。最近邻(NN)查询可识别最相似的项目,或者就距离而言,找到距离查询点最近的点。传统上,相似性是通过多维特征向量之间的距离函数来表征的。然而,当数据是高维数据时,传统距离函数无法显著区分最近点和最远点,因为很少有不同的维度主导距离函数。局部相似性函数,即仅考虑与查询接近的维度的函数,独立地对每个维度进行量化,并且仅对查询和点落入同一箱的维度计算相似性。这些量化与查询无关,并且当使用依赖于查询的量化时有可能提高准确性。在这项工作中,我们提出了一种依赖于查询的等深度(QED)实时量化方法,以改进高维相似性搜索。在查询时对每个维度进行量化,并为最接近的一部分点生成局部得分,而对其余点应用恒定惩罚。QED不仅提高了距离度量的质量,还通过过滤掉不相关数据提高了查询时间性能。我们提出了一种分布式索引和查询算法来高效地计算QED。我们的实验结果表明,在具有数百个维度的数据集上,分类准确率以及查询性能都有了提高,比基于曼哈顿距离的顺序扫描NN查询快了一个数量级。此外,使用QED的相似性搜索在维度数量和计算节点数量方面显示出线性或更好的可扩展性。