Hudek Alexander K, Brown Daniel G
School of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada.
BMC Bioinformatics. 2005 Nov 17;6:273. doi: 10.1186/1471-2105-6-273.
Multiple genome alignment is an important problem in bioinformatics. An important subproblem used by many multiple alignment approaches is that of aligning two multiple alignments. Many popular alignment algorithms for DNA use the sum-of-pairs heuristic, where the score of a multiple alignment is the sum of its induced pairwise alignment scores. However, the biological meaning of the sum-of-pairs of pairs heuristic is not obvious. Additionally, many algorithms based on the sum-of-pairs heuristic are complicated and slow, compared to pairwise alignment algorithms. An alternative approach to aligning alignments is to first infer ancestral sequences for each alignment, and then align the two ancestral sequences. In addition to being fast, this method has a clear biological basis that takes into account the evolution implied by an underlying phylogenetic tree. In this study we explore the accuracy of aligning alignments by ancestral sequence alignment. We examine the use of both maximum likelihood and parsimony to infer ancestral sequences. Additionally, we investigate the effect on accuracy of allowing ambiguity in our ancestral sequences.
We use synthetic sequence data that we generate by simulating evolution on a phylogenetic tree. We use two different types of phylogenetic trees: trees with a period of rapid growth followed by a period of slow growth, and trees with a period of slow growth followed by a period of rapid growth. We examine the alignment accuracy of four ancestral sequence reconstruction and alignment methods: parsimony, maximum likelihood, ambiguous parsimony, and ambiguous maximum likelihood. Additionally, we compare against the alignment accuracy of two sum-of-pairs algorithms: ClustalW and the heuristic of Ma, Zhang, and Wang.
We find that allowing ambiguity in ancestral sequences does not lead to better multiple alignments. Regardless of whether we use parsimony or maximum likelihood, the success of aligning ancestral sequences containing ambiguity is very sensitive to the choice of gap open cost. Surprisingly, we find that using maximum likelihood to infer ancestral sequences results in less accurate alignments than when using parsimony to infer ancestral sequences. Finally, we find that the sum-of-pairs methods produce better alignments than all of the ancestral alignment methods.
多基因组比对是生物信息学中的一个重要问题。许多多比对方法所使用的一个重要子问题是比对两个多序列比对。许多流行的DNA比对算法使用双序列比对得分总和启发式方法,其中多序列比对的得分是其诱导的双序列比对得分之和。然而,双序列比对得分总和启发式方法的生物学意义并不明显。此外,与双序列比对算法相比,许多基于双序列比对得分总和启发式方法的算法复杂且速度慢。比对序列比对的另一种方法是首先为每个比对推断祖先序列,然后比对这两个祖先序列。除了速度快之外,这种方法还有明确的生物学基础,它考虑了潜在系统发育树所隐含的进化。在本研究中,我们探索通过祖先序列比对来比对序列比对的准确性。我们研究了使用最大似然法和简约法来推断祖先序列。此外,我们研究了在祖先序列中允许存在模糊性对准确性的影响。
我们使用通过在系统发育树上模拟进化生成的合成序列数据。我们使用两种不同类型的系统发育树:先有一段快速生长期然后是一段缓慢生长期的树,以及先有一段缓慢生长期然后是一段快速生长期的树。我们研究了四种祖先序列重建和比对方法的比对准确性:简约法、最大似然法、模糊简约法和模糊最大似然法。此外,我们将其与两种双序列比对得分总和算法的比对准确性进行了比较:ClustalW以及Ma、Zhang和Wang的启发式方法。
我们发现允许祖先序列存在模糊性并不会带来更好的多序列比对。无论我们使用简约法还是最大似然法,比对包含模糊性的祖先序列的成功与否对空位开放成本的选择非常敏感。令人惊讶的是,我们发现使用最大似然法推断祖先序列所得到的比对准确性低于使用简约法推断祖先序列时的准确性。最后,我们发现双序列比对得分总和方法产生的比对结果比所有祖先比对方法都要好。