一种用于加速生物序列数据库搜索的局部比对度量。

A local alignment metric for accelerating biosequence database search.

作者信息

Spiro Peter A, Macura Natasa

机构信息

Incyte Genomics, Inc., 3160 Porter Drive, Palo Alto, CA 94304, USA.

出版信息

J Comput Biol. 2004;11(1):61-82. doi: 10.1089/106652704773416894.

DOI:10.1089/106652704773416894

PMID:15072689

Abstract

We introduce a metric for local sequence alignments that has utility for accelerating optimal alignment searches without loss of sensitivity. The metric's triangle inequality property permits identification of redundant database entries guaranteed to have optimal alignments to the query sequence that fall below a specified score threshold, thereby permitting comparisons to these entries to be skipped. We prove the existence of the metric for a variety of scoring systems, including the most commonly used ones, and show that a triangle inequality can be established as well for nucleotide-to-protein sequence comparisons. We discuss a database clustering and search strategy that takes advantage of the triangle inequality. The strategy permits moderate but significant acceleration of searches against the widely used "nr" protein database. It also provides a theoretically based method for database clustering in general and provides a standard against which to compare heuristic clustering strategies.

摘要

我们引入了一种用于局部序列比对的度量标准，该标准有助于在不损失灵敏度的情况下加速最优比对搜索。该度量标准的三角不等式性质允许识别冗余数据库条目，这些条目保证与查询序列具有低于指定分数阈值的最优比对，从而可以跳过与这些条目的比较。我们证明了该度量标准在各种评分系统（包括最常用的评分系统）中的存在性，并表明核苷酸到蛋白质序列比较也可以建立三角不等式。我们讨论了一种利用三角不等式的数据库聚类和搜索策略。该策略允许对广泛使用的“nr”蛋白质数据库进行适度但显著的加速搜索。它还为一般的数据库聚类提供了一种基于理论的方法，并提供了一个用于比较启发式聚类策略的标准。