Department of Computer Science, Iowa State University, Ames 50011, USA.
BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S14. doi: 10.1186/1471-2105-12-S1-S14.
The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard, and therefore, most analyses use heuristics that lack any performance guarantee.
We describe the first integer linear programming (ILP) formulation to solve instances of the gene duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the seed plant phylogeny using a 12-taxon, 6,084-gene data set. The unique, optimal solution, which places Gnetales sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in plant systematics.
Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1,000 genes in a few hours. These are the largest instances that have been solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates.
基因复制(GD)问题旨在寻找一个物种树,该树在给定的基因树集合中暗示了最少的基因复制事件。解决这个问题使得使用具有复杂复制和丢失历史的大型基因家族来推断系统发育树成为可能。然而,GD 问题是 NP 难的,因此,大多数分析都使用缺乏任何性能保证的启发式算法。
我们描述了第一个整数线性规划(ILP)公式来精确求解基因复制问题的实例。通过模拟,我们证明 ILP 解决方案可以解决多达 14 个分类单元的问题实例。此外,我们应用新的 ILP 解决方案来解决使用 12 个分类单元、6084 个基因数据集的种子植物系统发育的基因复制问题。唯一的、最优的解决方案将买麻藤目置于松柏目植物的姐妹群中,为植物系统学中最令人困惑的问题之一提供了一个新的、大规模的基因组视角。
尽管 GD 问题是 NP 难的,但我们的新 ILP 解决方案可以在几个小时内解决多达 14 个分类单元和 1000 个基因的数据集的实例。这些是迄今为止已优化解决的最大实例。因此,这项工作可以为以前只能通过启发式估计来解决的系统发育问题提供大规模的基因组视角。