用于生物序列比较的动态规划算法。

Dynamic programming algorithms for biological sequence comparison.

作者信息

Pearson W R, Miller W

机构信息

Department of Biochemistry, University of Virginia, Charlottesville 22908.

出版信息

Methods Enzymol. 1992;210:575-601. doi: 10.1016/0076-6879(92)10029-d.

DOI:10.1016/0076-6879(92)10029-d

PMID:1584052

Abstract

Efficient dynamic programming algorithms are available for a broad class of protein and DNA sequence comparison problems. These algorithms require computer time proportional to the product of the lengths of the two sequences being compared [O(N2)] but require memory space proportional only to the sum of these lengths [O(N)]. Although the requirement for O(N2) time limits use of the algorithms to the largest computers when searching protein and DNA sequence databases, many other applications of these algorithms, such as calculation of distances for evolutionary trees and comparison of a new sequence to a library of sequence profiles, are well within the capabilities of desktop computers. In particular, the results of library searches with rapid searching programs, such as FASTA or BLAST, should be confirmed by performing a rigorous optimal alignment. Whereas rapid methods do not overlook significant sequence similarities, FASTA limits the number of gaps that can be inserted into an alignment, so that a rigorous alignment may extend the alignment substantially in some cases. BLAST does not allow gaps in the local regions that it reports; a calculation that allows gaps is very likely to extend the alignment substantially. Although a Monte Carlo evaluation of the statistical significance of a similarity score with a rigorous algorithm is much slower than the heuristic approach used by the RDF2 program, the dynamic programming approach should take less than 1 hr on a 386-based PC or desktop Unix workstation. For descriptive purposes, we have limited our discussion to methods for calculating similarity scores and distances that use gap penalties of the form g = rk. Nevertheless, programs for the more general case (g = q+rk) are readily available. Versions of these programs that run either on Unix workstations, IBM-PC class computers, or the Macintosh can be obtained from either of the authors.

摘要

对于一大类蛋白质和DNA序列比较问题，已有高效的动态规划算法。这些算法所需的计算机时间与被比较的两条序列长度的乘积成正比[O(N2)]，但所需的内存空间仅与这些长度的总和成正比[O(N)]。尽管O(N2)时间要求限制了在搜索蛋白质和DNA序列数据库时将这些算法应用于最大型的计算机，但这些算法的许多其他应用，如计算进化树的距离以及将新序列与序列图谱库进行比较，都完全在台式计算机的能力范围内。特别是，使用FASTA或BLAST等快速搜索程序进行库搜索的结果，应通过进行严格的最优比对来确认。虽然快速方法不会忽略显著的序列相似性，但FASTA限制了可插入比对中的空位数量，因此在某些情况下，严格的比对可能会大幅扩展比对结果。BLAST在其报告的局部区域不允许有空位；允许有空位的计算很可能会大幅扩展比对结果。尽管使用严格算法对相似性得分的统计显著性进行蒙特卡罗评估比RDF2程序使用的启发式方法要慢得多，但动态规划方法在基于386的PC或台式Unix工作站上运行应耗时不到1小时。为了便于描述，我们将讨论限于使用形式为g = rk的空位罚分来计算相似性得分和距离的方法。然而，适用于更一般情况（g = q + rk）的程序也很容易获得。这些程序运行在Unix工作站、IBM - PC类计算机或Macintosh上的版本可从任何一位作者处获得。