Huson Daniel H, Rupp Regula, Berry Vincent, Gambette Philippe, Paul Christophe
Center for Bioinformatics ZBIT, Tübingen University, Tübingen, Germany.
Bioinformatics. 2009 Jun 15;25(12):i85-93. doi: 10.1093/bioinformatics/btp217.
Developing methods for computing phylogenetic networks from biological data is an important problem posed by molecular evolution and much work is currently being undertaken in this area. Although promising approaches exist, there are no tools available that biologists could easily and routinely use to compute rooted phylogenetic networks on real datasets containing tens or hundreds of taxa. Biologists are interested in clades, i.e. groups of monophyletic taxa, and these are usually represented by clusters in a rooted phylogenetic tree. The problem of computing an optimal rooted phylogenetic network from a set of clusters, is hard, in general. Indeed, even the problem of just determining whether a given network contains a given cluster is hard. Hence, some researchers have focused on topologically restricted classes of networks, such as galled trees and level-k networks, that are more tractable, but have the practical draw-back that a given set of clusters will usually not possess such a representation.
In this article, we argue that galled networks (a generalization of galled trees) provide a good trade-off between level of generality and tractability. Any set of clusters can be represented by some galled network and the question whether a cluster is contained in such a network is easy to solve. Although the computation of an optimal galled network involves successively solving instances of two different NP-complete problems, in practice our algorithm solves this problem exactly on large datasets containing hundreds of taxa and many reticulations in seconds, as illustrated by a dataset containing 279 prokaryotes.
We provide a fast, robust and easy-to-use implementation of this work in version 2.0 of our tree-handling software Dendroscope, freely available from http://www.dendroscope.org.
开发从生物数据计算系统发育网络的方法是分子进化提出的一个重要问题,目前该领域正在进行大量工作。尽管存在有前景的方法,但没有可供生物学家轻松且常规使用的工具来在包含数十或数百个分类单元的真实数据集上计算有根系统发育网络。生物学家对进化枝感兴趣,即单系分类单元的群体,这些通常由有根系统发育树中的聚类表示。一般来说,从一组聚类计算最优有根系统发育网络的问题很难。实际上,即使只是确定给定网络是否包含给定聚类的问题也很难。因此,一些研究人员专注于拓扑受限的网络类别,例如带结树和k级网络,它们更易于处理,但实际缺点是给定的一组聚类通常不会具有这样的表示。
在本文中,我们认为带结网络(带结树的推广)在通用性水平和可处理性之间提供了良好的权衡。任何一组聚类都可以由某个带结网络表示,并且确定一个聚类是否包含在这样的网络中的问题很容易解决。尽管计算最优带结网络涉及依次解决两个不同的NP完全问题的实例,但在实践中,我们的算法能在包含数百个分类单元和许多网状结构的大型数据集上在几秒钟内准确解决这个问题,如一个包含279个原核生物的数据集所示。
我们在树处理软件Dendroscope的2.0版本中提供了这项工作的快速、稳健且易于使用的实现,可从http://www.dendroscope.org免费获取。