Suppr超能文献

通过邻域计数确定最近邻域。

Nearest neighbors by neighborhood counting.

作者信息

Wang Hui

机构信息

School of Computing and Mathematics, Faculty of Engineering, University of Ulster at Jordanstown, BT37 OQB Northern Ireland, UK.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2006 Jun;28(6):942-53. doi: 10.1109/TPAMI.2006.126.

Abstract

Finding nearest neighbors is a general idea that underlies many artificial intelligence tasks, including machine learning, data mining, natural language understanding, and information retrieval. This idea is explicitly used in the k-nearest neighbors algorithm (kNN), a popular classification method. In this paper, this idea is adopted in the development of a general methodology, neighborhood counting, for devising similarity functions. We turn our focus from neighbors to neighborhoods, a region in the data space covering the data point in question. To measure the similarity between two data points, we consider all neighborhoods that cover both data points. We propose to use the number of such neighborhoods as a measure of similarity. Neighborhood can be defined for different types of data in different ways. Here, we consider one definition of neighborhood for multivariate data and derive a formula for such similarity, called neighborhood counting measure or NCM. NCM was tested experimentally in the framework of kNN. Experiments show that NCM is generally comparable to VDM and its variants, the state-of-the-art distance functions for multivariate data, and, at the same time, is consistently better for relatively large k values. Additionally, NCM consistently outperforms HEOM (a mixture of Euclidean and Hamming distances), the "standard" and most widely used distance function for multivariate data. NCM has a computational complexity in the same order as the standard Euclidean distance function and NCM is task independent and works for numerical and categorical data in a conceptually uniform way. The neighborhood counting methodology is proven sound for multivariate data experimentally. We hope it will work for other types of data.

摘要

寻找最近邻是许多人工智能任务的基础思想,包括机器学习、数据挖掘、自然语言理解和信息检索。这种思想在k近邻算法(kNN)中得到了明确应用,kNN是一种流行的分类方法。在本文中,这种思想被用于开发一种通用方法——邻域计数,以设计相似性函数。我们将关注点从邻居转移到邻域,邻域是数据空间中覆盖所讨论数据点的一个区域。为了测量两个数据点之间的相似性,我们考虑所有覆盖这两个数据点的邻域。我们建议使用此类邻域的数量作为相似性的度量。邻域可以针对不同类型的数据以不同方式定义。在此,我们考虑多元数据的一种邻域定义,并推导出这种相似性的公式,称为邻域计数度量(NCM)。NCM在kNN框架下进行了实验测试。实验表明,NCM通常与VDM及其变体(多元数据的当前最先进距离函数)相当,同时,对于相对较大的k值,NCM始终表现更好。此外,NCM始终优于HEOM(欧几里得距离和汉明距离的混合),HEOM是多元数据“标准”且使用最广泛的距离函数。NCM的计算复杂度与标准欧几里得距离函数处于同一量级,并且NCM与任务无关,以概念上统一的方式适用于数值数据和分类数据。邻域计数方法经实验证明对多元数据是合理的。我们希望它也适用于其他类型的数据。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验