Division of Information System Design, Tokyo Denki University, Hatoyama, Hiki, Saitama, Japan.
BMC Bioinformatics. 2012 Jul 2;13:155. doi: 10.1186/1471-2105-13-155.
Due to hybridization events in evolution, studying two different genes of a set of species may yield two related but different phylogenetic trees for the set of species. In this case, we want to combine the two phylogenetic trees into a hybridization network with the fewest hybridization events. This leads to three computational problems, namely, the problem of computing the minimum size of a hybridization network, the problem of constructing one minimum hybridization network, and the problem of enumerating a representative set of minimum hybridization networks. The previously best software tools for these problems (namely, Chen and Wang's HybridNet and Albrecht et al.'s Dendroscope 3) run very slowly for large instances that cannot be reduced to relatively small instances. Indeed, when the minimum size of a hybridization network of two given trees is larger than 23 and the problem for the trees cannot be reduced to relatively smaller independent subproblems, then HybridNet almost always takes longer than 1 day and Dendroscope 3 often fails to complete. Thus, a faster software tool for the problems is in need.
We develop a software tool in ANSI C, named FastHN, for the following problems: Computing the minimum size of a hybridization network, constructing one minimum hybridization network, and enumerating a representative set of minimum hybridization networks. We obtain FastHN by refining HybridNet with three ideas. The first idea is to preprocess the input trees so that the trees become smaller or the problem becomes to solve two or more relatively smaller independent subproblems. The second idea is to use a fast algorithm for computing the rSPR distance of two given phylognetic trees to cut more branches of the search tree in the exhaustive-search stage of the algorithm. The third idea is that during the exhaustive-search stage of the algorithm, we find two sibling leaves in one of the two forests (obtained from the given trees by cutting some edges) such that they are as far as possible in the other forest. As the result, FastHN always runs much faster than HybridNet. Unlike Dendroscope 3, FastHN is a single-threaded program. Despite this disadvantage, our experimental data shows that FastHN runs substantially faster than the multi-threaded Dendroscope 3 on a PC with multiple cores. Indeed, FastHN can finish within 16 minutes (on average on a Windows-7 (x64) desktop PC with i7-2600 CPU) even if the minimum size of a hybridization network of two given trees is about 25, the trees each have 100 leaves, and the problem for the input trees cannot be reduced to two or more independent subproblems via cluster reductions. It is also worth mentioning that like HybridNet, FastHN does not use much memory (indeed, the amount of memory is at most quadratic in the input size). In contrast, Dendroscope 3 uses a huge amount of memory. Executables of FastHN for Windows XP (x86), Windows 7 (x64), Linux, and Mac OS are available (see the Results and discussion section for details).
For both biological datasets and simulated datasets, our experimental results show that FastHN runs substantially faster than HybridNet and Dendroscope 3. The superiority of FastHN in speed over the previous tools becomes more significant as the hybridization number becomes larger. In addition, FastHN uses much less memory than Dendroscope 3 and uses the same amount of memory as HybridNet.
由于进化过程中的杂交事件,研究一组物种的两个不同基因可能会为该组物种产生两个相关但不同的系统发育树。在这种情况下,我们希望将这两个系统发育树组合成一个具有最少杂交事件的杂交网络。这导致了三个计算问题,即计算杂交网络最小大小的问题、构建一个最小杂交网络的问题和枚举最小杂交网络的代表集的问题。以前用于这些问题的最佳软件工具(即 Chen 和 Wang 的 HybridNet 和 Albrecht 等人的 Dendroscope 3)对于无法简化为相对较小实例的大型实例运行非常缓慢。事实上,当两个给定树的杂交网络的最小大小大于 23 且问题无法简化为相对较小的独立子问题时,HybridNet 几乎总是需要超过 1 天的时间,而 Dendroscope 3 通常无法完成。因此,需要更快的软件工具来解决这些问题。
我们使用 ANSI C 开发了一个名为 FastHN 的软件工具,用于以下问题:计算杂交网络的最小大小、构建一个最小杂交网络和枚举最小杂交网络的代表集。我们通过用三个想法来改进 HybridNet 来获得 FastHN。第一个想法是预处理输入树,以便树变得更小或问题变成解决两个或更多相对较小的独立子问题。第二个想法是使用快速算法来计算两个给定系统发育树的 rSPR 距离,以便在算法的穷举搜索阶段中切断更多的搜索树分支。第三个想法是,在算法的穷举搜索阶段,我们在两个森林之一(通过切断某些边缘从给定树中获得)中找到两个兄弟叶,使得它们在另一个森林中尽可能远。结果,FastHN 总是比 HybridNet 运行得快得多。与 Dendroscope 3 不同,FastHN 是一个单线程程序。尽管存在此缺点,但我们的实验数据表明,在具有多个内核的 PC 上,FastHN 的运行速度明显快于多线程的 Dendroscope 3。事实上,FastHN 甚至可以在 16 分钟内完成(平均在具有 i7-2600 CPU 的 Windows-7(x64)台式 PC 上),即使两个给定树的杂交网络的最小大小约为 25,每个树都有 100 个叶,并且输入树的问题无法通过聚类简化为两个或更多独立的子问题。值得一提的是,与 HybridNet 一样,FastHN 也不会使用太多内存(实际上,内存量最多是输入大小的二次方)。相比之下,Dendroscope 3 使用大量内存。FastHN 的 Windows XP(x86)、Windows 7(x64)、Linux 和 Mac OS 的可执行文件都可用(有关详细信息,请参见结果和讨论部分)。
对于生物数据集和模拟数据集,我们的实验结果表明,FastHN 的运行速度明显快于 HybridNet 和 Dendroscope 3。随着杂交数量的增加,FastHN 在速度上相对于以前的工具的优势变得更加显著。此外,FastHN 比 Dendroscope 3 使用的内存少得多,并且使用的内存与 HybridNet 相同。