Department of Computer Science and Engineering, University of South Carolina, Columbia, South Carolina, United States of America.
PLoS One. 2013 May 2;8(5):e62156. doi: 10.1371/journal.pone.0062156. Print 2013.
Recent advancement of technologies has now made it routine to obtain and compare gene orders within genomes. Rearrangements of gene orders by operations such as reversal and transposition are rare events that enable researchers to reconstruct deep evolutionary histories. An important application of genome rearrangement analysis is to infer gene orders of ancestral genomes, which is valuable for identifying patterns of evolution and for modeling the evolutionary processes. Among various available methods, parsimony-based methods (including GRAPPA and MGR) are the most widely used. Since the core algorithms of these methods are solvers for the so called median problem, providing efficient and accurate median solver has attracted lots of attention in this field. The "double-cut-and-join" (DCJ) model uses the single DCJ operation to account for all genome rearrangement events. Because mathematically it is much simpler than handling events directly, parsimony methods using DCJ median solvers has better speed and accuracy. However, the DCJ median problem is NP-hard and although several exact algorithms are available, they all have great difficulties when given genomes are distant. In this paper, we present a new algorithm that combines genetic algorithm (GA) with genomic sorting to produce a new method which can solve the DCJ median problem in limited time and space, especially in large and distant datasets. Our experimental results show that this new GA-based method can find optimal or near optimal results for problems ranging from easy to very difficult. Compared to existing parsimony methods which may severely underestimate the true number of evolutionary events, the sorting-based approach can infer ancestral genomes which are much closer to their true ancestors. The code is available at http://phylo.cse.sc.edu.
技术的最新进展使得在基因组中获取和比较基因顺序成为常规操作。通过反转和转座等操作进行基因顺序重排是罕见的事件,使研究人员能够重建深层进化历史。基因组重排分析的一个重要应用是推断祖先基因组的基因顺序,这对于识别进化模式和模拟进化过程非常有价值。在各种可用的方法中,基于简约性的方法(包括 GRAPPA 和 MGR)是最广泛使用的方法。由于这些方法的核心算法是所谓中位数问题的求解器,因此提供高效准确的中位数求解器在该领域引起了广泛关注。“双切接”(DCJ)模型使用单个 DCJ 操作来解释所有基因组重排事件。由于从数学角度来看,它比直接处理事件要简单得多,因此使用 DCJ 中位数求解器的简约性方法具有更好的速度和准确性。然而,DCJ 中位数问题是 NP 难的,尽管有几种精确算法可用,但当处理的基因组相距较远时,它们都存在很大的困难。在本文中,我们提出了一种新的算法,该算法将遗传算法(GA)与基因组排序相结合,生成一种新的方法,可以在有限的时间和空间内解决 DCJ 中位数问题,特别是在大型和远距离数据集上。我们的实验结果表明,这种新的基于 GA 的方法可以为从简单到非常困难的问题找到最佳或接近最佳的结果。与可能严重低估真实进化事件数量的现有简约性方法相比,基于排序的方法可以推断出更接近其真实祖先的祖先基因组。该代码可在 http://phylo.cse.sc.edu 获得。