迈向改进对最优系统发育树的搜索。

Ford Eric, St John Katherine, Wheeler Ward C

Department of Computer Science, Graduate Center, CUNY, New York, NY, 10016, USA; Department of Mathematics and Computer Science, Lehman College, CUNY, Bronx, NY, 10468, USA; and Division of Invertebrate Zoology, American Museum of Natural History, New York, NY, 10024, USA; Department of Computer Science, Graduate Center, CUNY, New York, NY, 10016, USA; Department of Mathematics and Computer Science, Lehman College, CUNY, Bronx, NY, 10468, USA; and Division of Invertebrate Zoology, American Museum of Natural History, New York, NY, 10024, USA;

Syst Biol. 2015 Jan;64(1):56-65. doi: 10.1093/sysbio/syu065. Epub 2014 Aug 26.

Finding the optimal evolutionary history for a set of taxa is a challenging computational problem, even when restricting possible solutions to be "tree-like" and focusing on the maximum-parsimony optimality criterion. This has led to much work on using heuristic tree searches to find approximate solutions. We present an approach for finding exact optimal solutions that employs and complements the current heuristic methods for finding optimal trees. Given a set of taxa and a set of aligned sequences of characters, there may be subsets of characters that are compatible, and for each such subset there is an associated (possibly partially resolved) phylogeny with edges corresponding to each character state change. These perfect phylogenies serve as anchor trees for our constrained search space. We show that, for sequences with compatible sites, the parsimony score of any tree [Formula: see text] is at least the parsimony score of the anchor trees plus the number of inferred changes between [Formula: see text] and the anchor trees. As the maximum-parsimony optimality score is additive, the sum of the lower bounds on compatible character partitions provides a lower bound on the complete alignment of characters. This yields a region in the space of trees within which the best tree is guaranteed to be found; limiting the search for the optimal tree to this region can significantly reduce the number of trees that must be examined in a search of the space of trees. We analyze this method empirically using four different biological data sets as well as surveying 400 data sets from the TreeBASE repository, demonstrating the effectiveness of our technique in reducing the number of steps in exact heuristic searches for trees under the maximum-parsimony optimality criterion.

为一组分类单元找到最优进化历史是一个具有挑战性的计算问题，即使将可能的解决方案限制为“树状”并专注于最大简约性最优标准。这导致了许多关于使用启发式树搜索来寻找近似解决方案的工作。我们提出了一种寻找精确最优解的方法，该方法采用并补充了当前用于寻找最优树的启发式方法。给定一组分类单元和一组对齐的字符序列，可能存在兼容的字符子集，并且对于每个这样的子集，都有一个相关的（可能部分解析的）系统发育树，其边对应于每个字符状态变化。这些完美系统发育树作为我们受限搜索空间的锚定树。我们表明，对于具有兼容位点的序列，任何树[公式：见原文]的简约得分至少是锚定树的简约得分加上[公式：见原文]与锚定树之间推断变化的数量。由于最大简约性最优得分是可加的，兼容字符分区的下界之和为字符的完整比对提供了一个下界。这在树的空间中产生了一个区域，在该区域内保证能找到最佳树；将最优树的搜索限制在这个区域可以显著减少在树空间搜索中必须检查的树的数量。我们使用四个不同的生物学数据集对该方法进行了实证分析，并对来自TreeBASE存储库的400个数据集进行了调查，证明了我们的技术在减少基于最大简约性最优标准的精确启发式树搜索步骤数量方面的有效性。