Friedman Nir, Ninio Matan, Pe'er Itsik, Pupko Tal
School of Computer Science and Engineering, Hebrew University, Jerusalem, 91904, Israel.
J Comput Biol. 2002;9(2):331-53. doi: 10.1089/10665270252935494.
A central task in the study of molecular evolution is the reconstruction of a phylogenetic tree from sequences of current-day taxa. The most established approach to tree reconstruction is maximum likelihood (ML) analysis. Unfortunately, searching for the maximum likelihood phylogenetic tree is computationally prohibitive for large data sets. In this paper, we describe a new algorithm that uses Structural Expectation Maximization (EM) for learning maximum likelihood phylogenetic trees. This algorithm is similar to the standard EM method for edge-length estimation, except that during iterations of the Structural EM algorithm the topology is improved as well as the edge length. Our algorithm performs iterations of two steps. In the E-step, we use the current tree topology and edge lengths to compute expected sufficient statistics, which summarize the data. In the M-Step, we search for a topology that maximizes the likelihood with respect to these expected sufficient statistics. We show that searching for better topologies inside the M-step can be done efficiently, as opposed to standard methods for topology search. We prove that each iteration of this procedure increases the likelihood of the topology, and thus the procedure must converge. This convergence point, however, can be a suboptimal one. To escape from such "local optima," we further enhance our basic EM procedure by incorporating moves in the flavor of simulated annealing. We evaluate these new algorithms on both synthetic and real sequence data and show that for protein sequences even our basic algorithm finds more plausible trees than existing methods for searching maximum likelihood phylogenies. Furthermore, our algorithms are dramatically faster than such methods, enabling, for the first time, phylogenetic analysis of large protein data sets in the maximum likelihood framework.
分子进化研究中的一个核心任务是根据当今生物分类群的序列重建系统发育树。最成熟的树重建方法是最大似然(ML)分析。不幸的是,对于大数据集而言,搜索最大似然系统发育树在计算上是难以实现的。在本文中,我们描述了一种新算法,该算法使用结构期望最大化(EM)来学习最大似然系统发育树。此算法与用于边长度估计的标准EM方法类似,不同之处在于在结构EM算法的迭代过程中,拓扑结构和边长度都会得到改进。我们的算法执行两个步骤的迭代。在E步骤中,我们使用当前的树拓扑结构和边长度来计算期望充分统计量,这些统计量总结了数据。在M步骤中,我们搜索一个相对于这些期望充分统计量使似然性最大化的拓扑结构。我们表明,与标准的拓扑搜索方法不同,在M步骤中搜索更好的拓扑结构可以高效完成。我们证明该过程的每次迭代都会增加拓扑结构的似然性,因此该过程必定收敛。然而,这个收敛点可能是次优的。为了摆脱这种“局部最优”,我们通过纳入模拟退火风格的移动来进一步增强我们的基本EM过程。我们在合成序列数据和真实序列数据上评估了这些新算法,结果表明对于蛋白质序列,即使是我们的基本算法也能找到比现有最大似然系统发育搜索方法更合理的树。此外,我们的算法比这些方法快得多,首次能够在最大似然框架下对大型蛋白质数据集进行系统发育分析。