Lewis P O
Department of Biology, University of New Mexico, Albuquerque 87131-1091, USA.
Mol Biol Evol. 1998 Mar;15(3):277-83. doi: 10.1093/oxfordjournals.molbev.a025924.
Phylogeny reconstruction is a difficult computational problem, because the number of possible solutions increases with the number of included taxa. For example, for only 14 taxa, there are more than seven trillion possible unrooted phylogenetic trees. For this reason, phylogenetic inference methods commonly use clustering algorithms (e.g., the neighbor-joining method) or heuristic search strategies to minimize the amount of time spent evaluating nonoptimal trees. Even heuristic searches can be painfully slow, especially when computationally intensive optimality criteria such as maximum likelihood are used. I describe here a different approach to heuristic searching (using a genetic algorithm) that can tremendously reduce the time required for maximum-likelihood phylogenetic inference, especially for data sets involving large numbers of taxa. Genetic algorithms are simulations of natural selection in which individuals are encoded solutions to the problem of interest. Here, labeled phylogenetic trees are the individuals, and differential reproduction is effected by allowing the number of offspring produced by each individual to be proportional to that individual's rank likelihood score. Natural selection increases the average likelihood in the evolving population of phylogenetic trees, and the genetic algorithm is allowed to proceed until the likelihood of the best individual ceases to improve over time. An example is presented involving rbcL sequence data for 55 taxa of green plants. The genetic algorithm described here required only 6% of the computational effort required by a conventional heuristic search using tree bisection/reconnection (TBR) branch swapping to obtain the same maximum-likelihood topology.
系统发育重建是一个困难的计算问题,因为可能的解决方案数量会随着所包含分类单元的数量增加而增加。例如,仅对于14个分类单元,就有超过七万亿种可能的无根系统发育树。因此,系统发育推断方法通常使用聚类算法(例如邻接法)或启发式搜索策略,以尽量减少评估非最优树所花费的时间。即使是启发式搜索也可能极其缓慢,特别是当使用计算密集型的最优性标准(如最大似然法)时。我在此描述一种不同的启发式搜索方法(使用遗传算法),它可以极大地减少最大似然系统发育推断所需的时间,特别是对于涉及大量分类单元的数据集。遗传算法是对自然选择的模拟,其中个体是对感兴趣问题的编码解决方案。在这里,带标签的系统发育树就是个体,而差异繁殖是通过允许每个个体产生的后代数量与其个体的排名似然得分成比例来实现的。自然选择会提高系统发育树进化种群中的平均似然性,并且遗传算法会一直运行,直到最佳个体的似然性不再随时间提高为止。给出了一个涉及55种绿色植物的rbcL序列数据的例子。这里描述的遗传算法仅需要使用树二分/重连(TBR)分支交换的传统启发式搜索来获得相同最大似然拓扑所需计算量的6%。