Berman P, Zhang Z, Wolf Y I, Koonin E V, Miller W
Department of Computer Science and Engineering, The Pennsylvania State University, University Park 16802, USA.
J Comput Biol. 2000 Feb-Apr;7(1-2):293-302. doi: 10.1089/10665270050081531.
In database searches for sequence similarity, matches to a distinct sequence region (e.g., protein domain) are frequently obscured by numerous matches to another region of the same sequence. In order to cope with this problem, algorithms are developed to discard redundant matches. One model for this problem begins with a list of intervals, each with an associated score; each interval gives the range of positions in the query sequence that align to a database sequence, and the score is that of the alignment. If interval I is contained in interval J, and I's score is less than J's, then I is said to be dominated by J. The problem is then to identify each interval that is dominated by at least K other intervals, where K is a given level of "tolerable redundancy." An algorithm is developed to solve the problem in O(N log N) time and O(N*) space, where N is the number of intervals and N* is a precisely defined value that never exceeds N and is frequently much smaller. This criterion for discarding database hits has been implemented in the Blast program, as illustrated herein with examples. Several variations and extensions of this approach are also described.
在数据库中搜索序列相似性时,与一个独特序列区域(如蛋白质结构域)的匹配常常会被与同一序列另一个区域的大量匹配所掩盖。为了解决这个问题,人们开发了一些算法来丢弃冗余匹配。针对这个问题的一种模型从一个区间列表开始,每个区间都有一个相关的分数;每个区间给出查询序列中与数据库序列对齐的位置范围,分数就是该比对的分数。如果区间I包含在区间J中,且I的分数小于J的分数,那么就说I被J所支配。问题就在于识别出每个被至少K个其他区间所支配的区间,其中K是给定的“可容忍冗余”水平。开发了一种算法,该算法能在O(N log N)时间和O(N*)空间内解决这个问题,其中N是区间的数量,N*是一个精确定义的值,它从不超过N,而且通常要小得多。如本文中的示例所示,这种丢弃数据库命中结果的标准已在Blast程序中实现。还描述了这种方法的几种变体和扩展。