Yujian Li, Bo Liu
College of Computer Science and Technology, Beijing University of Technology, Pingleyuan 100, Chaoyang District, Beijing 100022, P.R. China.
IEEE Trans Pattern Anal Mach Intell. 2007 Jun;29(6):1091-5. doi: 10.1109/TPAMI.2007.1078.
Although a number of normalized edit distances presented so far may offer good performance in some applications, none of them can be regarded as a genuine metric between strings because they do not satisfy the triangle inequality. Given two strings X and Y over a finite alphabet, this paper defines a new normalized edit distance between X and Y as a simple function of their lengths (|X| and |Y|) and the Generalized Levenshtein Distance (GLD) between them. The new distance can be easily computed through GLD with a complexity of O(|X|.|Y|) and it is a metric valued in [0, 1] under the condition that the weight function is a metric over the set of elementary edit operations with all costs of insertions/deletions having the same weight. Experiments using the AESA algorithm in handwritten digit recognition show that the new distance can generally provide similar results to some other normalized edit distances and may perform slightly better if the triangle inequality is violated in a particular data set.
尽管迄今为止提出的一些归一化编辑距离在某些应用中可能表现良好,但它们都不能被视为字符串之间真正的度量标准,因为它们不满足三角不等式。给定有限字母表上的两个字符串X和Y,本文将X和Y之间的新归一化编辑距离定义为它们长度(|X|和|Y|)以及它们之间的广义莱文斯坦距离(GLD)的简单函数。新距离可以通过GLD轻松计算,复杂度为O(|X|.|Y|),并且在权重函数是基本编辑操作集上的度量标准且所有插入/删除成本具有相同权重的条件下,它是一个取值在[0, 1]的度量标准。在手写数字识别中使用AESA算法的实验表明,新距离通常可以提供与其他一些归一化编辑距离相似的结果,并且如果在特定数据集中违反三角不等式,可能会表现得稍好一些。