Subramanian Amarendran R, Kaufmann Michael, Morgenstern Burkhard
University of Tübingen, Wilhelm-Schickard-Institut für Informatik, Sand 13, 72076 Tübingen, Germany.
Algorithms Mol Biol. 2008 May 27;3:6. doi: 10.1186/1748-7188-3-6.
DIALIGN-T is a reimplementation of the multiple-alignment program DIALIGN. Due to several algorithmic improvements, it produces significantly better alignments on locally and globally related sequence sets than previous versions of DIALIGN. However, like the original implementation of the program, DIALIGN-T uses a a straight-forward greedy approach to assemble multiple alignments from local pairwise sequence similarities. Such greedy approaches may be vulnerable to spurious random similarities and can therefore lead to suboptimal results. In this paper, we present DIALIGN-TX, a substantial improvement of DIALIGN-T that combines our previous greedy algorithm with a progressive alignment approach.
Our new heuristic produces significantly better alignments, especially on globally related sequences, without increasing the CPU time and memory consumption exceedingly. The new method is based on a guide tree; to detect possible spurious sequence similarities, it employs a vertex-cover approximation on a conflict graph. We performed benchmarking tests on a large set of nucleic acid and protein sequences For protein benchmarks we used the benchmark database BALIBASE 3 and an updated release of the database IRMBASE 2 for assessing the quality on globally and locally related sequences, respectively. For alignment of nucleic acid sequences, we used BRAliBase II for global alignment and a newly developed database of locally related sequences called DIRM-BASE 1. IRMBASE 2 and DIRMBASE 1 are constructed by implanting highly conserved motives at random positions in long unalignable sequences.
On BALIBASE3, our new program performs significantly better than the previous program DIALIGN-T and outperforms the popular global aligner CLUSTAL W, though it is still outperformed by programs that focus on global alignment like MAFFT, MUSCLE and T-COFFEE. On the locally related test sets in IRMBASE 2 and DIRM-BASE 1, our method outperforms all other programs while MAFFT E-INSi is the only method that comes close to the performance of DIALIGN-TX.
DIALIGN-T是多序列比对程序DIALIGN的重新实现。由于在算法上有多项改进,与DIALIGN的早期版本相比,它在局部和全局相关序列集上生成的比对结果有显著提升。然而,与该程序的原始实现一样,DIALIGN-T采用一种直接的贪心方法,从局部两两序列相似性中组装多序列比对。这种贪心方法可能容易受到虚假随机相似性的影响,因此可能导致次优结果。在本文中,我们介绍了DIALIGN-TX,它是DIALIGN-T的重大改进版本,将我们之前的贪心算法与渐进比对方法相结合。
我们新的启发式算法生成的比对结果显著更好,尤其是在全局相关序列上,同时不会过度增加CPU时间和内存消耗。新方法基于一棵引导树;为了检测可能的虚假序列相似性,它在冲突图上采用顶点覆盖近似法。我们对一大组核酸和蛋白质序列进行了基准测试。对于蛋白质基准测试,我们使用基准数据库BALIBASE 3和更新后的数据库IRMBASE 2版本,分别评估全局和局部相关序列的质量。对于核酸序列比对,我们使用BRAliBase II进行全局比对,并使用一个新开发的名为DIRM-BASE 1的局部相关序列数据库。IRMBASE 2和DIRMBASE 1是通过在长的不可比对序列中的随机位置植入高度保守基序构建的。
在BALIBASE3上,我们的新程序表现明显优于先前的程序DIALIGN-T,并且优于流行的全局比对工具CLUSTAL W,不过它仍然比专注于全局比对的程序如MAFFT、MUSCLE和T-COFFEE表现稍逊。在IRMBASE 2和DIRM-BASE 1中的局部相关测试集上,我们的方法优于所有其他程序,而MAFFT E-INSi是唯一接近DIALIGN-TX性能的方法。