一种通过最大似然法估计大型系统发育树的简单、快速且准确的算法。

A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.

作者信息

Guindon Stéphane, Gascuel Olivier

机构信息

LIRMM, CNRS, 161 Rue Ada, 34392, Montpellier Cedex 5, France.

出版信息

Syst Biol. 2003 Oct;52(5):696-704. doi: 10.1080/10635150390235520.

DOI:10.1080/10635150390235520

PMID:14530136

Abstract

The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximum- likelihood principle, which clearly satisfies these requirements. The core of this method is a simple hill-climbing algorithm that adjusts tree topology and branch lengths simultaneously. This algorithm starts from an initial tree built by a fast distance-based method and modifies this tree to improve its likelihood at each iteration. Due to this simultaneous adjustment of the topology and branch lengths, only a few iterations are sufficient to reach an optimum. We used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches. The reduction of computing time is dramatic in comparison with other maximum-likelihood packages, while the likelihood maximization ability tends to be higher. For example, only 12 min were required on a standard personal computer to analyze a data set consisting of 500 rbcL sequences with 1,428 base pairs from plant plastids, thus reaching a speed of the same order as some popular distance-based and parsimony algorithms. This new method is implemented in the PHYML program, which is freely available on our web page: http://www.lirmm.fr/w3ifa/MAAS/.

摘要

大数据集数量的增加以及当前概率序列进化模型的复杂性，使得快速且可靠的系统发育重建方法成为必要。我们描述了一种基于最大似然原理的新方法，该方法显然满足这些要求。此方法的核心是一种简单的爬山算法，它能同时调整树的拓扑结构和分支长度。该算法从由基于快速距离的方法构建的初始树开始，在每次迭代中修改此树以提高其似然值。由于拓扑结构和分支长度的这种同时调整，只需几次迭代就足以达到最优解。我们通过广泛且逼真的计算机模拟表明，这种新方法的拓扑准确性至少与现有的最大似然程序一样高，且远高于基于距离和简约方法的性能。与其他最大似然软件包相比，计算时间的减少非常显著，同时似然最大化能力往往更高。例如，在一台标准个人计算机上分析一个由来自植物质体的500个具有1428个碱基对的rbcL序列组成的数据集仅需12分钟，从而达到了与一些流行的基于距离和简约算法相同的速度级别。这种新方法在PHYML程序中实现，该程序可在我们的网页上免费获取：http://www.lirmm.fr/w3ifa/MAAS/ 。