刻画系统发育树搜索问题。

Characterizing the phylogenetic tree-search problem.

机构信息

Faculty of Life Sciences, University of Manchester, Michael Smith Building, Oxford Road, Manchester M13 9PT, UK.

出版信息

Syst Biol. 2012 Mar;61(2):228-39. doi: 10.1093/sysbio/syr097. Epub 2011 Nov 10.

PMID:22076302

Abstract

Phylogenetic trees are important in many areas of biological research, ranging from systematic studies to the methods used for genome annotation. Finding the best scoring tree under any optimality criterion is an NP-hard problem, which necessitates the use of heuristics for tree-search. Although tree-search plays a major role in obtaining a tree estimate, there remains a limited understanding of its characteristics and how the elements of the statistical inferential procedure interact with the algorithms used. This study begins to answer some of these questions through a detailed examination of maximum likelihood tree-search on a wide range of real genome-scale data sets. We examine all 10,395 trees for each of the 106 genes of an eight-taxa yeast phylogenomic data set, then apply different tree-search algorithms to investigate their performance. We extend our findings by examining two larger genome-scale data sets and a large disparate data set that has been previously used to benchmark the performance of tree-search programs. We identify several broad trends occurring during tree-search that provide an insight into the performance of heuristics and may, in the future, aid their development. These trends include a tendency for the true maximum likelihood (best) tree to also be the shortest tree in terms of branch lengths, a weak tendency for tree-search to recover the best tree, and a tendency for tree-search to encounter fewer local optima in genes that have a high information content. When examining current heuristics for tree-search, we find that nearest-neighbor-interchange performs poorly, and frequently finds trees that are significantly different from the best tree. In contrast, subtree-pruning-and-regrafting tends to perform well, nearly always finding trees that are not significantly different to the best tree. Finally, we demonstrate that the precise implementation of a tree-search strategy, including when and where parameters are optimized, can change the character of tree-search, and that good strategies for tree-search may combine existing tree-search programs.

摘要

系统发育树在生物研究的许多领域都很重要，从系统研究到基因组注释的方法。在任何最优准则下找到最佳得分树是一个 NP 难问题，这需要使用启发式方法进行树搜索。尽管树搜索在获得树估计中起着重要作用，但对于其特征以及统计推断过程的元素如何与使用的算法相互作用，仍存在有限的理解。本研究通过对广泛的真实基因组数据集上的最大似然树搜索进行详细检查，开始回答其中的一些问题。我们检查了一个八物种酵母系统发育基因组数据集的 106 个基因中的每个基因的 10395 棵树，然后应用不同的树搜索算法来研究它们的性能。我们通过检查两个更大的基因组数据集和一个以前用于基准测试树搜索程序性能的大型异类数据集来扩展我们的发现。我们确定了在树搜索过程中发生的几种广泛趋势，这些趋势提供了对启发式算法性能的深入了解，并可能在未来有助于它们的发展。这些趋势包括真实最大似然（最佳）树在分支长度方面也最短的趋势，树搜索恢复最佳树的趋势较弱，以及树搜索在信息含量高的基因中遇到较少局部最优的趋势。在检查当前的树搜索启发式算法时，我们发现最近邻交换表现不佳，并且经常找到与最佳树明显不同的树。相比之下，子树修剪和重接往往表现良好，几乎总是找到与最佳树没有显著差异的树。最后，我们证明了树搜索策略的精确实现，包括何时何地优化参数，可以改变树搜索的性质，并且树搜索的好策略可能会结合现有的树搜索程序。