距离度量选择对 K-最近邻分类器性能的影响：综述

Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review.

机构信息

Department of Computer Science, Faculty of Information Technology, Mutah University, Karak, Jordan.

Department of Algorithm and Their Applications, Eötvös Loránd University, Budapest, Hungary.

出版信息

Big Data. 2019 Dec;7(4):221-248. doi: 10.1089/big.2018.0175. Epub 2019 Aug 14.

DOI:10.1089/big.2018.0175

PMID:31411491

Abstract

The K-nearest neighbor (KNN) classifier is one of the simplest and most common classifiers, yet its performance competes with the most complex classifiers in the literature. The core of this classifier depends mainly on measuring the distance or similarity between the tested examples and the training examples. This raises a major question about which distance measures to be used for the KNN classifier among a large number of distance and similarity measures available? This review attempts to answer this question through evaluating the performance (measured by accuracy, precision, and recall) of the KNN using a large number of distance measures, tested on a number of real-world data sets, with and without adding different levels of noise. The experimental results show that the performance of KNN classifier depends significantly on the distance used, and the results showed large gaps between the performances of different distances. We found that a recently proposed nonconvex distance performed the best when applied on most data sets comparing with the other tested distances. In addition, the performance of the KNN with this top performing distance degraded only ∼20% while the noise level reaches 90%, this is true for most of the distances used as well. This means that the KNN classifier using any of the top 10 distances tolerates noise to a certain degree. Moreover, the results show that some distances are less affected by the added noise comparing with other distances.

摘要

K 近邻（KNN）分类器是最简单和最常见的分类器之一，但它的性能可与文献中最复杂的分类器相媲美。这个分类器的核心主要取决于测量测试样本和训练样本之间的距离或相似度。这就提出了一个主要问题，即在大量可用的距离和相似度度量中，应该使用哪些距离度量来进行 KNN 分类器？本综述通过评估大量距离度量在大量真实数据集上的性能（通过准确性、精度和召回率来衡量），试图回答这个问题，并且在有无添加不同程度的噪声的情况下进行了测试。实验结果表明，KNN 分类器的性能显著依赖于所使用的距离，并且不同距离的性能之间存在很大差距。我们发现，在大多数数据集上，最近提出的一种非凸距离的性能优于其他测试距离。此外，当噪声水平达到 90%时，使用性能最佳的距离的 KNN 分类器的性能仅下降约 20%，大多数使用的距离也是如此。这意味着，使用前 10 个距离中的任何一个的 KNN 分类器在一定程度上可以容忍噪声。此外，结果表明，与其他距离相比，某些距离受添加噪声的影响较小。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

距离度量选择对 K-最近邻分类器性能的影响：综述

Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review.

机构信息

出版信息

相似文献

引用本文的文献

距离度量选择对 K-最近邻分类器性能的影响：综述

Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review.

机构信息

出版信息

相似文献

引用本文的文献