Arslan A N, Eğecioğlu O, Pevzner P A
Department of Computer Science, University of California, Santa Barbara, Santa Barbara, CA 93106, USA Department of Computer Science and Engineering, University of California, San Diego, San Diego, CA 92093, USA.
Bioinformatics. 2001 Apr;17(4):327-37. doi: 10.1093/bioinformatics/17.4.327.
The Smith-Waterman algorithm for local sequence alignment is one of the most important techniques in computational molecular biology. This ingenious dynamic programming approach was designed to reveal the highly conserved fragments by discarding poorly conserved initial and terminal segments. However, the existing notion of local similarity has a serious flaw: it does not discard poorly conserved intermediate segments. The Smith-Waterman algorithm finds the local alignment with maximal score but it is unable to find local alignment with maximum degree of similarity (e.g. maximal percent of matches). Moreover, there is still no efficient algorithm that answers the following natural question: do two sequences share a (sufficiently long) fragment with more than 70% of similarity? As a result, the local alignment sometimes produces a mosaic of well-conserved fragments artificially connected by poorly-conserved or even unrelated fragments. This may lead to problems in comparison of long genomic sequences and comparative gene prediction as recently pointed out by Zhang et al. (Bioinformatics, 15, 1012-1019, 1999). In this paper we propose a new sequence comparison algorithm (normalized local alignment ) that reports the regions with maximum degree of similarity. The algorithm is based on fractional programming and its running time is O(n2log n). In practice, normalized local alignment is only 3-5 times slower than the standard Smith-Waterman algorithm.
用于局部序列比对的史密斯-沃特曼算法是计算分子生物学中最重要的技术之一。这种巧妙的动态规划方法旨在通过舍弃保守性较差的起始和末端片段来揭示高度保守的片段。然而,现有的局部相似性概念存在一个严重缺陷:它不会舍弃保守性较差的中间片段。史密斯-沃特曼算法能找到得分最高的局部比对,但无法找到相似度最高的局部比对(例如匹配百分比最高)。此外,仍然没有一种有效的算法能回答以下自然问题:两个序列是否共享一个相似度超过70%的(足够长的)片段?因此,局部比对有时会产生由保守性较差甚至不相关的片段人为连接起来的高度保守片段的拼接图。正如张等人最近指出的(《生物信息学》,第15卷,第1012 - 1019页,1999年),这可能会在长基因组序列比较和比较基因预测中导致问题。在本文中,我们提出了一种新的序列比较算法(归一化局部比对),该算法能报告相似度最高的区域。该算法基于分式规划,其运行时间为O(n2log n)。在实际应用中,归一化局部比对仅比标准的史密斯-沃特曼算法慢3至5倍。