Molloy Erin K, Warnow Tandy
Department of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL 61801 USA.
Algorithms Mol Biol. 2019 Jul 19;14:14. doi: 10.1186/s13015-019-0151-x. eCollection 2019.
Divide-and-conquer methods, which divide the species set into overlapping subsets, construct a tree on each subset, and then combine the subset trees using a supertree method, provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NP-hard optimization problems, limits the scalability of such approaches.
In this paper, we introduce a divide-and-conquer approach that does not require supertree estimation: we divide the species set into pairwise disjoint subsets, construct a tree on each subset using a base method, and then combine the subset trees using a distance matrix. For this merger step, we present a new method, called NJMerge, which is a polynomial-time extension of Neighbor Joining (NJ); thus, NJMerge can be viewed either as a method for improving traditional NJ or as a method for scaling the base method to larger datasets. We prove that NJMerge can be used to create divide-and-conquer pipelines that are statistically consistent under some models of evolution. We also report the results of an extensive simulation study evaluating NJMerge on multi-locus datasets with up to 1000 species. We found that NJMerge sometimes improved the accuracy of traditional NJ and substantially reduced the running time of three popular species tree methods (ASTRAL-III, SVDquartets, and "concatenation" using RAxML) without sacrificing accuracy. Finally, although NJMerge can fail to return a tree, in our experiments, NJMerge failed on only 11 out of 2560 test cases.
Theoretical and empirical results suggest that NJMerge is a valuable technique for large-scale phylogeny estimation, especially when computational resources are limited. NJMerge is freely available on Github (http://github.com/ekmolloy/njmerge).
分治方法将物种集划分为重叠子集,在每个子集上构建一棵树,然后使用超树方法组合子集树,为提高系统发育估计方法对大型数据集的可扩展性提供了关键的算法框架。然而,超树方法的使用通常试图解决NP难优化问题,限制了此类方法的可扩展性。
在本文中,我们介绍了一种不需要超树估计的分治方法:我们将物种集划分为两两不相交的子集,使用基本方法在每个子集上构建一棵树,然后使用距离矩阵组合子集树。对于这个合并步骤,我们提出了一种新方法,称为NJMerge,它是邻接法(NJ)的多项式时间扩展;因此,NJMerge既可以看作是一种改进传统NJ的方法,也可以看作是一种将基本方法扩展到更大数据集的方法。我们证明NJMerge可用于创建在某些进化模型下具有统计一致性的分治管道。我们还报告了一项广泛模拟研究的结果,该研究在多达1000个物种的多基因座数据集上评估了NJMerge。我们发现NJMerge有时提高了传统NJ的准确性,并大幅缩短了三种流行物种树方法(ASTRAL-III、SVDquartets和使用RAxML的“串联”)的运行时间,且不牺牲准确性。最后,虽然NJMerge可能无法返回一棵树,但在我们的实验中,NJMerge在2560个测试用例中仅失败了11次。
理论和实证结果表明,NJMerge是大规模系统发育估计的一种有价值的技术,特别是在计算资源有限的情况下。NJMerge可在Github(http://github.com/ekmolloy/njmerge)上免费获取。