Lam Fumei, Alexandersson Marina, Pachter Lior
Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
J Comput Biol. 2003;10(3-4):509-20. doi: 10.1089/10665270360688156.
The application of Needleman-Wunsch alignment techniques to biological sequences is complicated by two serious problems when the sequences are long: the running time, which scales as the product of the lengths of sequences, and the difficulty in obtaining suitable parameters that produce meaningful alignments. The running time problem is often corrected by reducing the search space, using techniques such as banding, or chaining of high-scoring pairs. The parameter problem is more difficult to fix, partly because the probabilistic model, which Needleman-Wunsch is equivalent to, does not capture a key feature of biological sequence alignments, namely the alternation of conserved blocks and seemingly unrelated nonconserved segments. We present a solution to the problem of designing efficient search spaces for pair hidden Markov models that align biological sequences by taking advantage of their associated features. Our approach leads to an optimization problem, for which we obtain a 2-approximation algorithm, and that is based on the construction of Manhattan networks, which are close relatives of Steiner trees. We describe the underlying theory and show how our methods can be applied to alignment of DNA sequences in practice, successfully reducing the Viterbi algorithm search space of alignment PHMMs by three orders of magnitude.
当序列较长时,将Needleman-Wunsch比对技术应用于生物序列会受到两个严重问题的困扰:运行时间,它与序列长度的乘积成正比;以及难以获得能产生有意义比对的合适参数。运行时间问题通常通过减少搜索空间来解决,例如使用带状比对或高分对链接等技术。参数问题则更难解决,部分原因是与Needleman-Wunsch等价的概率模型没有捕捉到生物序列比对的一个关键特征,即保守块和看似不相关的非保守片段的交替。我们提出了一种解决方案,通过利用生物序列的相关特征为成对隐马尔可夫模型设计高效的搜索空间。我们的方法导致了一个优化问题,对于这个问题我们得到了一个2近似算法,该算法基于曼哈顿网络的构建,曼哈顿网络是斯坦纳树的近亲。我们描述了其基础理论,并展示了我们的方法在实际中如何应用于DNA序列比对,成功地将比对PHMMs的维特比算法搜索空间减少了三个数量级。