Department of Computer Science, Calvin College, Grand Rapids, MI 49546, USA.
Bioinformatics. 2012 Jun 15;28(12):i274-82. doi: 10.1093/bioinformatics/bts218.
While phylogenetic analyses of datasets containing 1000-5000 sequences are challenging for existing methods, the estimation of substantially larger phylogenies poses a problem of much greater complexity and scale.
We present DACTAL, a method for phylogeny estimation that produces trees from unaligned sequence datasets without ever needing to estimate an alignment on the entire dataset. DACTAL combines iteration with a novel divide-and-conquer approach, so that each iteration begins with a tree produced in the prior iteration, decomposes the taxon set into overlapping subsets, estimates trees on each subset, and then combines the smaller trees into a tree on the full taxon set using a new supertree method. We prove that DACTAL is guaranteed to produce the true tree under certain conditions. We compare DACTAL to SATé and maximum likelihood trees on estimated alignments using simulated and real datasets with 1000-27 643 taxa.
Our studies show that on average DACTAL yields more accurate trees than the two-phase methods we studied on very large datasets that are difficult to align, and has approximately the same accuracy on the easier datasets. The comparison to SATé shows that both have the same accuracy, but that DACTAL achieves this accuracy in a fraction of the time. Furthermore, DACTAL can analyze larger datasets than SATé, including a dataset with almost 28 000 sequences.
DACTAL source code and results of dataset analyses are available at www.cs.utexas.edu/users/phylo/software/dactal.
虽然对于现有方法来说,分析包含 1000-5000 个序列的数据集的系统发育是具有挑战性的,但估计数量更大的系统发育则是一个更为复杂和大规模的问题。
我们提出了 DACTAL 方法,这是一种用于系统发育估计的方法,它可以从不对齐的序列数据集生成树,而无需在整个数据集上估计对齐。DACTAL 结合了迭代和一种新的分治方法,因此每个迭代都从前一次迭代生成的树开始,将分类群集分解为重叠子集,在每个子集中估计树,然后使用新的超树方法将较小的树合并到完整分类群集中的树上。我们证明了在某些条件下,DACTAL 保证生成真实的树。我们将 DACTAL 与 SATé 和最大似然树在使用模拟和真实数据集的估计对齐上进行比较,这些数据集的分类群数为 1000-27643。
我们的研究表明,在非常大的难以对齐的数据集上,DACTAL 平均比我们研究的两阶段方法产生更准确的树,并且在较容易的数据集上具有大致相同的准确性。与 SATé 的比较表明,两者具有相同的准确性,但 DACTAL 可以在更短的时间内实现这一准确性。此外,DACTAL 可以分析比 SATé 更大的数据集,包括一个几乎包含 28000 个序列的数据集。
DACTAL 的源代码和数据集分析结果可在 www.cs.utexas.edu/users/phylo/software/dactal 上获得。