Lathrop R H
Department of Information and Computer Science, University of California, Irvine 92697, USA.
J Comput Biol. 1999 Fall-Winter;6(3-4):405-18. doi: 10.1089/106652799318355.
This paper describes a novel anytime branch-and-bound or best-first threading search algorithm for gapped block protein sequence-structure alignment with general sequence residue pair interactions. The new algorithm (1) returns a good approximate answer quickly, (2) iteratively improves that answer to the global optimum if allowed more time, (3) eventually produces a proof that the final answer found is indeed the global optimum, and (4) always terminates correctly within a bounded number of steps if allowed sufficient space and time. It runs in polynomial space, which is asymptotically dominated by the theta(m2ñ2) space required by the lower bound computation. Using previously published data sets and the Bryant-Lawrence (1993) objective function, the algorithm found the true (proven) global optimum in less than 5 min in all search spaces size 10(25) or smaller (sequences to 478 residues), and a putative (not guaranteed) optimum in less than 5 hr in all search spaces size 10(60) or smaller (sequences to 793 residues, cores to 42 secondary structure segments). The threading in the largest case studied was eventually proven to be globally optimal; the corresponding search speed in that case was the equivalent of 1.5 x 10(56) threadings/sec, a speed-up exceeding 10(25) over previously published batch branch-and-bound speeds, and exceeding 10(50) over previously published exhaustive search speeds, using the same objective function and threading paradigm. Implementation-independent measures of search efficiency are defined for equivalent branching factor, depth, and probability of success per draw; empirical data on these measures are given. The general approach should apply to other alignment methodologies and search methods that use a divide-and-conquer strategy.
本文描述了一种新颖的随时可用的分支定界法或最佳优先穿线搜索算法,用于带空位的蛋白质序列-结构比对,并考虑一般序列残基对相互作用。新算法具有以下特点:(1)能快速给出一个较好的近似答案;(2)如果有更多时间,可迭代地将该答案改进为全局最优解;(3)最终能给出证明,表明找到的最终答案确实是全局最优解;(4)如果有足够的空间和时间,总能在有限步数内正确终止。它在多项式空间中运行,其渐近复杂度由下界计算所需的theta(m2ñ2)空间主导。使用先前发表的数据集和Bryant-Lawrence(1993)目标函数,该算法在所有大小为10(25)或更小的搜索空间(序列长度达478个残基)中,不到5分钟就能找到真正(已证明)的全局最优解;在所有大小为10(60)或更小的搜索空间(序列长度达793个残基,核心区域达42个二级结构片段)中,不到5小时就能找到一个假定(未保证)的最优解。在所研究的最大案例中,穿线最终被证明是全局最优的;在该案例中,相应的搜索速度相当于每秒1.5 x 10(56)次穿线,与先前发表的批量分支定界速度相比,加速超过10(25),与先前发表的穷举搜索速度相比,加速超过10(50),且使用相同的目标函数和穿线范式。针对等效分支因子、深度和每次抽取的成功概率,定义了与实现无关的搜索效率度量;并给出了这些度量的经验数据。该通用方法应适用于其他使用分治法策略的比对方法和搜索方法。