Pardi Fabio, Scornavacca Celine
Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM, UMR 5506) CNRS, Université de Montpellier, France; Institut de Biologie Computationnelle, Montpellier, France.
Institut des Sciences de l'Evolution de Montpellier (ISE-M, UMR 5554) CNRS, IRD, Université de Montpellier, France; Institut de Biologie Computationnelle, Montpellier, France.
PLoS Comput Biol. 2015 Apr 7;11(4):e1004135. doi: 10.1371/journal.pcbi.1004135. eCollection 2015 Apr.
Phylogenetic networks represent the evolution of organisms that have undergone reticulate events, such as recombination, hybrid speciation or lateral gene transfer. An important way to interpret a phylogenetic network is in terms of the trees it displays, which represent all the possible histories of the characters carried by the organisms in the network. Interestingly, however, different networks may display exactly the same set of trees, an observation that poses a problem for network reconstruction: from the perspective of many inference methods such networks are "indistinguishable". This is true for all methods that evaluate a phylogenetic network solely on the basis of how well the displayed trees fit the available data, including all methods based on input data consisting of clades, triples, quartets, or trees with any number of taxa, and also sequence-based approaches such as popular formalisations of maximum parsimony and maximum likelihood for networks. This identifiability problem is partially solved by accounting for branch lengths, although this merely reduces the frequency of the problem. Here we propose that network inference methods should only attempt to reconstruct what they can uniquely identify. To this end, we introduce a novel definition of what constitutes a uniquely reconstructible network. For any given set of indistinguishable networks, we define a canonical network that, under mild assumptions, is unique and thus representative of the entire set. Given data that underwent reticulate evolution, only the canonical form of the underlying phylogenetic network can be uniquely reconstructed. While on the methodological side this will imply a drastic reduction of the solution space in network inference, for the study of reticulate evolution this is a fundamental limitation that will require an important change of perspective when interpreting phylogenetic networks.
系统发育网络表示经历了网状事件(如重组、杂交物种形成或横向基因转移)的生物体的进化。解释系统发育网络的一种重要方式是根据它所展示的树,这些树代表了网络中生物体所携带特征的所有可能历史。然而,有趣的是,不同的网络可能展示完全相同的一组树,这一观察结果给网络重建带来了一个问题:从许多推断方法的角度来看,这样的网络是“无法区分的”。对于所有仅根据所展示的树与可用数据的拟合程度来评估系统发育网络的方法都是如此,包括所有基于由进化枝、三元组、四重奏或具有任意数量分类单元的树组成的输入数据的方法,以及基于序列的方法,如网络的最大简约法和最大似然法的流行形式。通过考虑分支长度,这个可识别性问题得到了部分解决,尽管这只是降低了问题出现的频率。在这里,我们提出网络推断方法应该只尝试重建它们能够唯一识别的东西。为此,我们引入了一个关于什么构成唯一可重建网络的新定义。对于任何给定的一组无法区分的网络,我们定义一个规范网络,在温和的假设下,它是唯一的,因此代表了整个集合。给定经历了网状进化的数据,只有基础系统发育网络的规范形式能够被唯一重建。虽然在方法学方面,这将意味着网络推断中解空间的大幅减少,但对于网状进化的研究来说,这是一个基本限制,在解释系统发育网络时需要一个重要的视角转变。