Guan X, Uberbacher E C
Computer Science and Mathematics Division, Oak Ridge National Laboratory, TN 37831-6364, USA.
Comput Appl Biosci. 1996 Feb;12(1):31-40. doi: 10.1093/bioinformatics/12.1.31.
Molecular sequences, like all experimental data, are subject to error. Many current DNA sequencing protocols have very significant error rates and often generate artefactual insertions and deletions of bases (indels) which corrupt the translation of sequences and compromise the detection of protein homologies. The impact of these errors on the utility of molecular sequence data is dependent on the analytic technique used to interpret the data. In the presence of frameshift errors, standard algorithms using six-frame translation can miss important homologies because only subfragments of the correct translation are available in any given frame. We present a new algorithm which can detect and correct frameshift errors in DNA sequences during comparison of translated sequences with protein sequences in the databases. This algorithm can recognize homologous proteins sharing 30% identity even in the presence of a 7% frameshift error rate. Our algorithm uses dynamic programming, producing a guaranteed optimal alignment in the presence of frameshifts, and has a sensitivity equivalent to Smith-Waterman. The computational efficiency of the algorithm is O(nm) where n and m are the sizes of two sequences being compared. The algorithm does not rely on prior knowledge or heuristic rules and performs significantly better than any previously reported method.
与所有实验数据一样,分子序列也可能存在误差。许多当前的DNA测序方案具有非常高的错误率,并且经常产生人为的碱基插入和缺失(插入缺失),这会破坏序列的翻译并影响蛋白质同源性的检测。这些误差对分子序列数据实用性的影响取决于用于解释数据的分析技术。在存在移码错误的情况下,使用六框翻译的标准算法可能会错过重要的同源性,因为在任何给定的框架中只有正确翻译的子片段可用。我们提出了一种新算法,该算法可以在将翻译后的序列与数据库中的蛋白质序列进行比较时检测并纠正DNA序列中的移码错误。即使在存在7%移码错误率的情况下,该算法也能识别出具有30%同一性的同源蛋白质。我们的算法使用动态规划,在存在移码的情况下产生有保证的最优比对,并且具有与Smith-Waterman相当的灵敏度。该算法的计算效率为O(nm),其中n和m是被比较的两个序列的大小。该算法不依赖于先验知识或启发式规则,并且比任何先前报道的方法表现都要好得多。