Feldbauer Roman, Flexer Arthur
Austrian Research Institute for Artificial Intelligence, Freyung 6/6/7, 1010 Vienna, Austria.
Knowl Inf Syst. 2019;59(1):137-166. doi: 10.1007/s10115-018-1205-y. Epub 2018 May 18.
Hubness is an aspect of the related to the effect. occur in high-dimensional data spaces as objects that are particularly often among the nearest neighbors of other objects. Conversely, other data objects become , which are rarely or never nearest neighbors to other objects. Many machine learning algorithms rely on nearest neighbor search and some form of measuring distances, which are both impaired by high hubness. Degraded performance due to hubness has been reported for various tasks such as classification, clustering, regression, visualization, recommendation, retrieval and outlier detection. Several hubness reduction methods based on different paradigms have previously been developed. Local and global scaling as well as shared neighbors approaches aim at repairing asymmetric neighborhood relations. Global and localized centering try to eliminate spatial centrality, while the related global and local dissimilarity measures are based on density gradient flattening. Additional methods and alternative dissimilarity measures that were argued to mitigate detrimental effects of distance concentration also influence the related hubness phenomenon. In this paper, we present a large-scale empirical evaluation of all available unsupervised hubness reduction methods and dissimilarity measures. We investigate several aspects of hubness reduction as well as its influence on data semantics which we measure via nearest neighbor classification. Scaling and density gradient flattening methods improve evaluation measures such as hubness and classification accuracy consistently for data sets from a wide range of domains, while centering approaches achieve the same only under specific settings.
枢纽性是与效应相关的一个方面。在高维数据空间中会出现枢纽点,它们是其他对象最邻近的邻居中特别常见的对象。相反,其他数据对象则成为非枢纽点,很少或从不成为其他对象的最邻近邻居。许多机器学习算法依赖最近邻搜索和某种形式的距离测量,而这两者都会受到高枢纽性的影响。对于分类、聚类、回归、可视化、推荐、检索和异常检测等各种任务,已报告了由于枢纽性导致的性能下降。先前已经开发了几种基于不同范式的枢纽性降低方法。局部和全局缩放以及共享邻居方法旨在修复不对称的邻域关系。全局和局部中心化试图消除空间中心性,而相关的全局和局部差异度量则基于密度梯度扁平化。其他被认为可减轻距离集中有害影响的方法和替代差异度量也会影响相关的枢纽性现象。在本文中,我们对所有可用的无监督枢纽性降低方法和差异度量进行了大规模实证评估。我们研究了枢纽性降低的几个方面及其对数据语义的影响,我们通过最近邻分类来衡量数据语义。对于来自广泛领域的数据集,缩放和密度梯度扁平化方法一致地改善了诸如枢纽性和分类准确率等评估指标,而中心化方法仅在特定设置下才能达到相同效果。