Berger Bonnie, Waterman Michael S, Yu Yun William
Department of Mathematics and Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA, and also with the Department of Computer Science and AI Lab, Massachusetts Institute of Technology, Cambridge, MA 02139 USA.
Quantitative and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089 USA.
IEEE Trans Inf Theory. 2021 Jun;67(6):3287-3294. doi: 10.1109/tit.2020.2996543. Epub 2020 May 21.
Levenshtein edit distance has played a central role-both past and present-in sequence alignment in particular and biological database similarity search in general. We start our review with a history of dynamic programming algorithms for computing Levenshtein distance and sequence alignments. Following, we describe how those algorithms led to heuristics employed in the most widely used software in bioinformatics, BLAST, a program to search DNA and protein databases for evolutionarily relevant similarities. More recently, the advent of modern genomic sequencing and the volume of data it generates has resulted in a return to the problem of local alignment. We conclude with how the mathematical formulation of Levenshtein distance as a metric made possible additional optimizations to similarity search in biological contexts. These modern optimizations are built around the low metric entropy and fractional dimensionality of biological databases, enabling orders of magnitude acceleration of biological similarity search.
莱文斯坦编辑距离在过去和现在都在序列比对(特别是)以及一般的生物数据库相似性搜索中发挥了核心作用。我们先回顾一下用于计算莱文斯坦距离和序列比对的动态规划算法的历史。接下来,我们描述这些算法如何演变成生物信息学中最广泛使用的软件BLAST(一个在DNA和蛋白质数据库中搜索进化相关相似性的程序)所采用的启发式方法。最近,现代基因组测序的出现及其产生的数据量导致了对局部比对问题的回归。我们最后阐述将莱文斯坦距离作为一种度量的数学公式如何使得在生物背景下对相似性搜索进行额外优化成为可能。这些现代优化是围绕生物数据库的低度量熵和分数维构建的,从而能够将生物相似性搜索加速几个数量级。