Wu T J, Burke J P, Davison D B
Department of Applied Mathematics, National Donghwa University, Hualien, Taiwan.
Biometrics. 1997 Dec;53(4):1431-9.
A number of algorithms exist for searching genetic databases for biologically significant similarities in DNA sequences. Past research has shown that word-based search tools are computationally efficient and can find similarities or dissimilarities invisible to other algorithms like FASTA. We characterize a family of word-based dissimilarity measures that define distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) in the two sequences. Applications to real data demonstrate that currently used word-based methods that rely on Euclidean distance can be significantly improved by using Mahalanobis distance, which accounts for both variances and covariances between frequencies of n-words. Furthermore, in those cases where Mahalanobis distance may be too difficult to compute, using standardized Euclidean distance, which only corrects for the variances of frequencies of n-words, still gives better performance than the Euclidean distance. Also, a simple way of combining distances obtained at different n-words is considered. The goal is to obtain a single measure of dissimilarity between two DNA sequences. The performance ranking of the preceding three distances still holds for their combined counterparts. All results obtained in this paper are applicable to amino acid sequences with minor modifications.
存在许多用于在基因数据库中搜索DNA序列中生物学显著相似性的算法。过去的研究表明,基于词的搜索工具在计算上是高效的,并且能够发现其他算法(如FASTA)所无法察觉的相似性或差异。我们描述了一类基于词的差异度量,通过同时比较两个序列中所有n个相邻字母的子序列(即n词)的频率来定义两个序列之间的距离。对实际数据的应用表明,当前使用的依赖欧几里得距离的基于词的方法,通过使用考虑了n词频率之间的方差和协方差的马氏距离,可以得到显著改进。此外,在马氏距离可能难以计算的情况下,使用仅校正n词频率方差的标准化欧几里得距离,其性能仍然优于欧几里得距离。同时,还考虑了一种组合在不同n词处获得的距离的简单方法。目标是获得两个DNA序列之间差异的单一度量。上述三种距离的性能排名在其组合形式中仍然成立。本文获得的所有结果经过微小修改后适用于氨基酸序列。