Laboratory for Computational Biology and Bioinformatics, EPFL, CH-1015 Lausanne, Switzerland.
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S30. doi: 10.1186/1471-2105-11-S1-S30.
The study of genome rearrangements has become a mainstay of phylogenetics and comparative genomics. Fundamental in such a study is the median problem: given three genomes find a fourth that minimizes the sum of the evolutionary distances between itself and the given three. Many exact algorithms and heuristics have been developed for the inversion median problem, of which the best known is MGR.
We present a unifying framework for median heuristics, which enables us to clarify existing strategies and to place them in a partial ordering. Analysis of this framework leads to a new insight: the best strategies continue to refer to the input data rather than reducing the problem to smaller instances. Using this insight, we develop a new heuristic for inversion medians that uses input data to the end of its computation and leverages our previous work with DCJ medians. Finally, we present the results of extensive experimentation showing that our new heuristic outperforms all others in accuracy and, especially, in running time: the heuristic typically returns solutions within 1% of optimal and runs in seconds to minutes even on genomes with 25'000 genes--in contrast, MGR can take days on instances of 200 genes and cannot be used beyond 1'000 genes.
Finding good rearrangement medians, in particular inversion medians, had long been regarded as the computational bottleneck in whole-genome studies. Our new heuristic for inversion medians, ASM, which dominates all others in our framework, puts that issue to rest by providing near-optimal solutions within seconds to minutes on even the largest genomes.
基因组重排的研究已成为系统发育学和比较基因组学的主要内容。在这样的研究中,中位数问题是基础:给定三个基因组,找到第四个基因组,使其自身与给定三个基因组之间的进化距离总和最小。已经开发出许多用于反转中位数问题的精确算法和启发式算法,其中最著名的是 MGR。
我们提出了一种中位数启发式算法的统一框架,使我们能够澄清现有的策略,并将它们置于部分排序中。对该框架的分析导致了一个新的见解:最佳策略继续引用输入数据,而不是将问题简化为较小的实例。利用这一见解,我们开发了一种新的反转中位数启发式算法,该算法将输入数据一直用于其计算,并利用我们以前在 DCJ 中位数方面的工作。最后,我们展示了广泛的实验结果,表明我们的新启发式算法在准确性方面优于所有其他算法,尤其是在运行时间方面:该启发式算法通常可以在 1%的最优范围内返回解决方案,并且即使在具有 25000 个基因的基因组上也可以在几秒钟到几分钟内运行,相比之下,MGR 在 200 个基因的实例上可能需要数天的时间,并且不能用于超过 1000 个基因的实例。
寻找良好的重排中位数,特别是反转中位数,长期以来一直被认为是全基因组研究中的计算瓶颈。我们的新反转中位数启发式算法 ASM 在我们的框架中优于所有其他算法,通过在几秒钟到几分钟内提供接近最优的解决方案,解决了这个问题,即使在最大的基因组上也是如此。