Gusfield D
Computer Science Division, University of California, Davis 95616-8755.
Bull Math Biol. 1993 Jan;55(1):141-54. doi: 10.1007/BF02460299.
Multiple string (sequence) alignment is a difficult and important problem in computational biology, where it is central in two related tasks: finding highly conserved subregions or embedded patterns of a set of biological sequences (strings of DNA, RNA or amino acids), and inferring the evolutionary history of a set of taxa from their associated biological sequences. Several precise measures have been proposed for evaluating the goodness of a multiple alignment, but no efficient methods are known which compute the optimal alignment for any of these measures in any but small cases. In this paper, we consider two previously proposed measures, and give two computationaly efficient multiple alignment methods (one for each measure) whose deviation from the optimal value is guaranteed to be less than a factor of two. This is the novel feature of thse methods. but the methods have additional virtues as well. For both methods, the guaranteed bounds are much smaller than two when the number of strings is small (1.33 for three strings of any length); for one of the methods we give a related randomized method which is much faster and which gives, with high probability, multiple alignments with fairly small error bounds; and for the other measure, the method given yields a non-obvious lower bound on the optimal alignment.
多重序列比对是计算生物学中一个困难而重要的问题,它在两个相关任务中处于核心地位:找到一组生物序列(DNA、RNA或氨基酸序列)中高度保守的子区域或嵌入模式,以及从一组生物序列推断一组分类单元的进化历史。已经提出了几种精确的度量来评估多重比对的优劣,但除了小样本情况外,尚无有效的方法能够计算出这些度量中任何一种的最优比对。在本文中,我们考虑了两种先前提出的度量,并给出了两种计算效率高的多重比对方法(每种度量一种),其与最优值的偏差保证小于2倍。这是这些方法的新颖之处。但这些方法还有其他优点。对于这两种方法,当序列数量较少时(任意长度的三个序列时为1.33),保证的界限远小于2;对于其中一种方法,我们给出了一种相关的随机方法,该方法速度快得多,并且以高概率给出误差界限相当小的多重比对;对于另一种度量,所给出的方法给出了最优比对的一个非平凡下界。