系统发育树搜索的新方法及其在大量蛋白质序列比对中的应用。

New approaches to phylogenetic tree search and their application to large numbers of protein alignments.

作者信息

Whelan Simon

机构信息

Faculty of Life Sciences, Michael Smith Building, University of Manchester, Oxford Road, Manchester, UK.

出版信息

Syst Biol. 2007 Oct;56(5):727-40. doi: 10.1080/10635150701611134.

DOI:10.1080/10635150701611134

PMID:17849327

Abstract

Phylogenetic tree estimation plays a critical role in a wide variety of molecular studies, including molecular systematics, phylogenetics, and comparative genomics. Finding the optimal tree relating a set of sequences using score-based (optimality criterion) methods, such as maximum likelihood and maximum parsimony, may require all possible trees to be considered, which is not feasible even for modest numbers of sequences. In practice, trees are estimated using heuristics that represent a trade-off between topological accuracy and speed. I present a series of novel algorithms suitable for score-based phylogenetic tree reconstruction that demonstrably improve the accuracy of tree estimates while maintaining high computational speeds. The heuristics function by allowing the efficient exploration of large numbers of trees through novel hill-climbing and resampling strategies. These heuristics, and other computational approximations, are implemented for maximum likelihood estimation of trees in the program Leaphy, and its performance is compared to other popular phylogenetic programs. Trees are estimated from 4059 different protein alignments using a selection of phylogenetic programs and the likelihoods of the tree estimates are compared. Trees estimated using Leaphy are found to have equal to or better likelihoods than trees estimated using other phylogenetic programs in 4004 (98.6%) families and provide a unique best tree that no other program found in 1102 (27.1%) families. The improvement is particularly marked for larger families (80 to 100 sequences), where Leaphy finds a unique best tree in 81.7% of families.

摘要

系统发育树估计在包括分子系统学、系统发育学和比较基因组学在内的各种分子研究中起着关键作用。使用基于分数的（最优性标准）方法，如最大似然法和最大简约法，找到与一组序列相关的最优树，可能需要考虑所有可能的树，即使对于数量不多的序列，这也是不可行的。在实践中，使用代表拓扑准确性和速度之间权衡的启发式方法来估计树。我提出了一系列适用于基于分数的系统发育树重建的新颖算法，这些算法在保持高计算速度的同时，显著提高了树估计的准确性。这些启发式方法通过新颖的爬山和重采样策略，有效地探索大量的树。这些启发式方法和其他计算近似方法在Leaphy程序中实现，用于树的最大似然估计，并将其性能与其他流行的系统发育程序进行比较。使用一系列系统发育程序从4059个不同的蛋白质比对中估计树，并比较树估计的似然性。发现使用Leaphy估计的树在4004个（98.6%）家族中具有等于或优于使用其他系统发育程序估计的树的似然性，并且在1102个（27.1%）家族中提供了其他程序未找到的唯一最佳树。对于较大的家族（80至100个序列），这种改进尤为明显，其中Leaphy在81.7%的家族中找到了唯一的最佳树。