Wu Gang, Kao Ming-Yang, Lin Guohui, You Jia-Huai
Department of Computing Science, University of Alberta, Edmonton, Alberta T6G 2E8, Canada.
Algorithms Mol Biol. 2008 Jan 24;3:1. doi: 10.1186/1748-7188-3-1.
In recent years, quartet-based phylogeny reconstruction methods have received considerable attentions in the computational biology community. Traditionally, the accuracy of a phylogeny reconstruction method is measured by simulations on synthetic datasets with known "true" phylogenies, while little theoretical analysis has been done. In this paper, we present a new model-based approach to measuring the accuracy of a quartet-based phylogeny reconstruction method. Under this model, we propose three efficient algorithms to reconstruct the "true" phylogeny with a high success probability.
The first algorithm can reconstruct the "true" phylogeny from the input quartet topology set without quartet errors in O(n2) time by querying at most (n - 4) log(n - 1) quartet topologies, where n is the number of the taxa. When the input quartet topology set contains errors, the second algorithm can reconstruct the "true" phylogeny with a probability approximately 1 - p in O(n4 log n) time, where p is the probability for a quartet topology being an error. This probability is improved by the third algorithm to approximately [equation; see text], where [equation, see text], with running time of O(n5), which is at least 0.984 when p < 0.05.
The three proposed algorithms are mathematically guaranteed to reconstruct the "true" phylogeny with a high success probability. The experimental results showed that the third algorithm produced phylogenies with a higher probability than its aforementioned theoretical lower bound and outperformed some existing phylogeny reconstruction methods in both speed and accuracy.
近年来,基于四重奏的系统发育重建方法在计算生物学界受到了广泛关注。传统上,系统发育重建方法的准确性是通过在具有已知“真实”系统发育的合成数据集上进行模拟来衡量的,而理论分析较少。在本文中,我们提出了一种基于模型的新方法来衡量基于四重奏的系统发育重建方法的准确性。在这个模型下,我们提出了三种高效算法,以高成功率重建“真实”系统发育。
第一种算法可以通过最多查询(n - 4) log(n - 1)个四重奏拓扑结构,在O(n2)时间内从无四重奏错误的输入四重奏拓扑结构集中重建“真实”系统发育,其中n是分类单元的数量。当输入四重奏拓扑结构集包含错误时,第二种算法可以在O(n4 log n)时间内以大约1 - p的概率重建“真实”系统发育,其中p是四重奏拓扑结构出现错误的概率。第三种算法将这个概率提高到大约[公式;见原文],其中[公式,见原文],运行时间为O(n5) , 当p < 0.05时至少为0.984。
所提出的三种算法在数学上保证能以高成功率重建“真实”系统发育。实验结果表明,第三种算法生成系统发育的概率高于上述理论下限,并且在速度和准确性方面均优于一些现有的系统发育重建方法。