Schleif Frank-Michael, Tino Peter
University of Birmingham, School of Computer Science, B15 2TT, Birmingham, U.K.
Neural Comput. 2015 Oct;27(10):2039-96. doi: 10.1162/NECO_a_00770. Epub 2015 Aug 27.
Efficient learning of a data analysis task strongly depends on the data representation. Most methods rely on (symmetric) similarity or dissimilarity representations by means of metric inner products or distances, providing easy access to powerful mathematical formalisms like kernel or branch-and-bound approaches. Similarities and dissimilarities are, however, often naturally obtained by nonmetric proximity measures that cannot easily be handled by classical learning algorithms. Major efforts have been undertaken to provide approaches that can either directly be used for such data or to make standard methods available for these types of data. We provide a comprehensive survey for the field of learning with nonmetric proximities. First, we introduce the formalism used in nonmetric spaces and motivate specific treatments for nonmetric proximity data. Second, we provide a systematization of the various approaches. For each category of approaches, we provide a comparative discussion of the individual algorithms and address complexity issues and generalization properties. In a summarizing section, we provide a larger experimental study for the majority of the algorithms on standard data sets. We also address the problem of large-scale proximity learning, which is often overlooked in this context and of major importance to make the method relevant in practice. The algorithms we discuss are in general applicable for proximity-based clustering, one-class classification, classification, regression, and embedding approaches. In the experimental part, we focus on classification tasks.
高效学习数据分析任务很大程度上依赖于数据表示。大多数方法借助度量内积或距离依赖于(对称的)相似性或不相似性表示,这便于使用强大的数学形式体系,如核方法或分支定界法。然而,相似性和不相似性通常是通过非度量接近度度量自然获得的,而经典学习算法难以处理这些度量。人们已经做出了重大努力来提供可以直接用于此类数据的方法,或者使标准方法适用于这些类型的数据。我们对非度量接近度学习领域进行了全面综述。首先,我们介绍非度量空间中使用的形式体系,并阐述对非度量接近度数据的特定处理方法。其次,我们对各种方法进行了系统化整理。对于每一类方法,我们对各个算法进行了比较讨论,并探讨了复杂度问题和泛化特性。在总结部分,我们针对标准数据集上的大多数算法进行了规模更大的实验研究。我们还讨论了大规模接近度学习问题,这个问题在此背景下常常被忽视,但对于使该方法在实际中具有相关性至关重要。我们讨论的算法通常适用于基于接近度的聚类、单类分类、分类、回归和嵌入方法。在实验部分,我们重点关注分类任务。