Xia Xuhua
Department of Biology, University of Ottawa, 30 Marie Curie, Ottawa K1N 6N5, Canada; Ottawa Institute of Systems Biology, 451 Smyth Road, Ottawa, ON K1H 8M5, Canada.
Mol Phylogenet Evol. 2016 Sep;102:331-43. doi: 10.1016/j.ympev.2016.07.001. Epub 2016 Jul 1.
While pairwise sequence alignment (PSA) by dynamic programming is guaranteed to generate one of the optimal alignments, multiple sequence alignment (MSA) of highly divergent sequences often results in poorly aligned sequences, plaguing all subsequent phylogenetic analysis. One way to avoid this problem is to use only PSA to reconstruct phylogenetic trees, which can only be done with distance-based methods. I compared the accuracy of this new computational approach (named PhyPA for phylogenetics by pairwise alignment) against the maximum likelihood method using MSA (the ML+MSA approach), based on nucleotide, amino acid and codon sequences simulated with different topologies and tree lengths. I present a surprising discovery that the fast PhyPA method consistently outperforms the slow ML+MSA approach for highly diverged sequences even when all optimization options were turned on for the ML+MSA approach. Only when sequences are not highly diverged (i.e., when a reliable MSA can be obtained) does the ML+MSA approach outperforms PhyPA. The true topologies are always recovered by ML with the true alignment from the simulation. However, with MSA derived from alignment programs such as MAFFT or MUSCLE, the recovered topology consistently has higher likelihood than that for the true topology. Thus, the failure to recover the true topology by the ML+MSA is not because of insufficient search of tree space, but by the distortion of phylogenetic signal by MSA methods. I have implemented in DAMBE PhyPA and two approaches making use of multi-gene data sets to derive phylogenetic support for subtrees equivalent to resampling techniques such as bootstrapping and jackknifing.
虽然通过动态规划进行的成对序列比对(PSA)必定会生成最优比对结果之一,但高度分化序列的多序列比对(MSA)往往会导致序列比对不佳,这给所有后续的系统发育分析带来了困扰。避免这个问题的一种方法是仅使用PSA来重建系统发育树,而这只能通过基于距离的方法来完成。我基于用不同拓扑结构和树长模拟的核苷酸、氨基酸和密码子序列,将这种新的计算方法(通过成对比对进行系统发育分析,命名为PhyPA)与使用MSA的最大似然法(ML+MSA方法)的准确性进行了比较。我有一个惊人的发现,即对于高度分化的序列,即使为ML+MSA方法开启了所有优化选项,快速的PhyPA方法也始终优于缓慢的ML+MSA方法。只有当序列分化程度不高时(即当可以获得可靠的MSA时),ML+MSA方法才会优于PhyPA。通过模拟中的真实比对,ML总能恢复真实的拓扑结构。然而,对于从MAFFT或MUSCLE等比对程序得到的MSA,恢复的拓扑结构的似然性始终高于真实拓扑结构。因此,ML+MSA未能恢复真实拓扑结构并非因为对树空间的搜索不足,而是因为MSA方法对系统发育信号的扭曲。我已在DAMBE中实现了PhyPA以及另外两种利用多基因数据集来获得与重抽样技术(如自展法和刀切法)等效的子树系统发育支持的方法。