Rosenberg Michael S
Center for Evolutionary Functional Genomics, The Biodesign Institute, and the School of Life Sciences, Arizona State University, Tempe, AZ 85287-4501, USA.
BMC Bioinformatics. 2005 Nov 23;6:278. doi: 10.1186/1471-2105-6-278.
Sequence alignment is a common tool in bioinformatics and comparative genomics. It is generally assumed that multiple sequence alignment yields better results than pair wise sequence alignment, but this assumption has rarely been tested, and never with the control provided by simulation analysis. This study used sequence simulation to examine the gain in accuracy of adding a third sequence to a pair wise alignment, particularly concentrating on how the phylogenetic position of the additional sequence relative to the first pair changes the accuracy of the initial pair's alignment as well as their estimated evolutionary distance.
The maximal gain in alignment accuracy was found not when the third sequence is directly intermediate between the initial two sequences, but rather when it perfectly subdivides the branch leading from the root of the tree to one of the original sequences (making it half as close to one sequence as the other). Evolutionary distance estimation in the multiple alignment framework, however, is largely unrelated to alignment accuracy and rather is dependent on the position of the third sequence; the closer the branch leading to the third sequence is to the root of the tree, the larger the estimated distance between the first two sequences.
The bias in distance estimation appears to be a direct result of the standard greedy progressive algorithm used by many multiple alignment methods. These results have implications for choosing new taxa and genomes to sequence when resources are limited.
序列比对是生物信息学和比较基因组学中的常用工具。通常认为多序列比对比两两序列比对能产生更好的结果,但这一假设很少得到验证,更从未在模拟分析提供的对照下进行验证。本研究利用序列模拟来检验在两两比对中添加第三条序列时准确性的提高情况,特别关注额外序列相对于最初两条序列的系统发育位置如何改变最初两条序列比对的准确性及其估计的进化距离。
发现比对准确性的最大提高并非出现在第三条序列直接位于最初两条序列之间时,而是出现在它将从树根到原始序列之一的分支完美细分时(使其到一条序列的距离是到另一条序列距离的一半)。然而,多序列比对框架中的进化距离估计在很大程度上与比对准确性无关,而是取决于第三条序列的位置;通向第三条序列的分支离树根越近,最初两条序列之间估计的距离就越大。
距离估计中的偏差似乎是许多多序列比对方法所使用的标准贪婪渐进算法的直接结果。这些结果对于在资源有限时选择新的分类单元和基因组进行测序具有启示意义。