Zhang Zefeng, Lin Hao, Li Ming
Computational Biology Research Group, Division of Intelligent Software Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China.
Comput Syst Bioinformatics Conf. 2007;6:237-47.
Multiple sequence alignment is a classical and challenging task for biological sequence analysis. The problem is NP-hard. The full dynamic programming takes too much time. The progressive alignment heuristics adopted by most state of the art multiple sequence alignment programs suffer from the 'once a gap, always a gap' phenomenon. Is there a radically new way to do multiple sequence alignment? This paper introduces a novel and orthogonal multiple sequence alignment method, using multiple optimized spaced seeds and new algorithms to handle these seeds efficiently. Our new algorithm processes information of all sequences as a whole, avoiding problems caused by the popular progressive approaches. Because the optimized spaced seeds are provably significantly more sensitive than the consecutive k-mers, the new approach promises to be more accurate and reliable. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. Experiments were carried out on large 16S RNA benchmarks showing that MANGO compares favorably, in both accuracy and speed, against state-of-art multiple sequence alignment methods, including ClustalW 1.83, MUSCLE 3.6, MAFFT 5.861, Prob-ConsRNA 1.11, Dialign 2.2.1, DIALIGN-T 0.2.1, T-Coffee 4.85, POA 2.0 and Kalign 2.0.
多序列比对是生物序列分析中的一项经典且具有挑战性的任务。该问题是NP难问题。完整的动态规划方法耗时过长。大多数先进的多序列比对程序所采用的渐进比对启发式算法存在“一旦有缺口,永远有缺口”的现象。是否存在一种全新的多序列比对方法呢?本文介绍了一种新颖且正交的多序列比对方法,该方法使用多个优化的间隔种子以及新算法来高效处理这些种子。我们的新算法将所有序列的信息作为一个整体进行处理,避免了流行的渐进方法所导致的问题。由于优化的间隔种子经证明比连续的k-mer序列明显更具敏感性,因此新方法有望更加准确和可靠。为了验证我们的新方法,我们实现了MANGO:带N个缺口寡核苷酸的多序列比对工具。在大型16S RNA基准数据集上进行的实验表明,MANGO在准确性和速度方面均优于包括ClustalW 1.83、MUSCLE 3.6、MAFFT 5.861、Prob-ConsRNA 1.11、Dialign 2.2.1、DIALIGN-T 0.2.1、T-Coffee 4.85、POA 2.0和Kalign 2.0在内的现有多序列比对方法。