Department of Computer Science, ETH Zurich, Universitaetstr, 6, 8092 Zürich, Switzerland.
Genome Biol. 2010;11(4):R37. doi: 10.1186/gb-2010-11-4-r37. Epub 2010 Apr 6.
The alignment of biological sequences is of chief importance to most evolutionary and comparative genomics studies, yet the two main approaches used to assess alignment accuracy have flaws: reference alignments are derived from the biased sample of proteins with known structure, and simulated data lack realism.
Here, we introduce tree-based tests of alignment accuracy, which not only use large and representative samples of real biological data, but also enable the evaluation of the effect of gap placement on phylogenetic inference. We show that (i) the current belief that consistency-based alignments outperform scoring matrix-based alignments is misguided; (ii) gaps carry substantial phylogenetic signal, but are poorly exploited by most alignment and tree building programs; (iii) even so, excluding gaps and variable regions is detrimental; (iv) disagreement among alignment programs says little about the accuracy of resulting trees.
This study provides the broad community relying on sequence alignment with important practical recommendations, sets superior standards for assessing alignment accuracy, and paves the way for the development of phylogenetic inference methods of significantly higher resolution.
生物序列的比对对大多数进化和比较基因组学研究至关重要,但用于评估比对准确性的两种主要方法都存在缺陷:参考比对是从具有已知结构的蛋白质的有偏差的样本中得出的,而模拟数据缺乏现实性。
在这里,我们介绍了基于树的比对准确性测试,该测试不仅使用了大量具有代表性的真实生物数据样本,而且还能够评估空位放置对系统发育推断的影响。我们表明:(i)目前认为基于一致性的比对优于基于评分矩阵的比对的观点是错误的;(ii)空位携带大量系统发育信号,但大多数比对和建树程序都未能很好地利用这些信号;(iii)即便如此,排除空位和可变区是有害的;(iv)不同的比对程序之间的分歧并不能说明生成的树的准确性。
这项研究为依赖序列比对的广大社区提供了重要的实用建议,为评估比对准确性设定了更高的标准,并为开发分辨率显著提高的系统发育推断方法铺平了道路。