School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel.
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S42. doi: 10.1186/1471-2105-11-S1-S42.
Genomic data provide a wealth of new information for phylogenetic analysis. Yet making use of this data requires phylogenetic methods that can efficiently analyze extremely large data sets and account for processes of gene evolution, such as gene duplication and loss, incomplete lineage sorting (deep coalescence), or horizontal gene transfer, that cause incongruence among gene trees. One such approach is gene tree parsimony, which, given a set of gene trees, seeks a species tree that requires the smallest number of evolutionary events to explain the incongruence of the gene trees. However, the only existing algorithms for gene tree parsimony under the duplication-loss or deep coalescence reconciliation cost are prohibitively slow for large datasets.
We describe novel algorithms for SPR and TBR based local search heuristics under the duplication-loss cost, and we show how they can be adapted for the deep coalescence cost. These algorithms improve upon the best existing algorithms for these problems by a factor of n, where n is the number of species in the collection of gene trees. We implemented our new SPR based local search algorithm for the duplication-loss cost and demonstrate the tremendous improvement in runtime and scalability it provides compared to existing implementations. We also evaluate the performance of our algorithm on three large-scale genomic data sets.
Our new algorithms enable, for the first time, gene tree parsimony analyses of thousands of genes from hundreds of taxa using the duplication-loss and deep coalescence reconciliation costs. Thus, this work expands both the size of data sets and the range of evolutionary models that can be incorporated into genome-scale phylogenetic analyses.
基因组数据为系统发育分析提供了丰富的新信息。然而,要利用这些数据,需要使用能够有效分析极其大型数据集的系统发育方法,并能够解释基因进化过程,例如基因复制和丢失、不完全谱系分选(深合并)或水平基因转移,这些过程会导致基因树之间的不一致。一种这样的方法是基因树简约法,给定一组基因树,它会寻找一个物种树,该树需要最少的进化事件来解释基因树的不一致性。然而,对于复制-丢失或深合并重定代价下的基因树简约法,唯一现有的算法对于大型数据集来说非常缓慢。
我们描述了基于 SPR 和 TBR 的新的局部搜索启发式算法,用于复制-丢失代价,并且展示了如何将它们适用于深合并代价。与这些问题的现有最佳算法相比,这些算法的速度提高了 n 倍,其中 n 是基因树集合中的物种数量。我们实现了我们新的基于 SPR 的局部搜索算法,用于复制-丢失代价,并展示了它在运行时间和可扩展性方面提供的巨大改进,与现有实现相比。我们还在三个大型基因组数据集上评估了我们算法的性能。
我们的新算法首次能够使用复制-丢失和深合并重定代价对来自数百个分类群的数千个基因进行基因树简约法分析。因此,这项工作扩展了数据集的大小和可以纳入基因组规模系统发育分析的进化模型的范围。