Warnow Tandy, Tabatabaee Yasamin, Evans Steven N
Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA.
Department of Statistics, University of California at Berkeley, Berkeley, California, USA.
J Comput Biol. 2025 Jan;32(1):3-27. doi: 10.1089/cmb.2024.0710. Epub 2024 Nov 25.
We address the problem of how to estimate a phylogenetic network when given single-nucleotide polymorphisms (i.e., SNPs, or bi-allelic markers that have evolved under the infinite sites assumption). We focus on level-1 phylogenetic networks (i.e., networks where the cycles are node-disjoint), since more complex networks are unidentifiable. We provide a polynomial time quartet-based method that we prove correct for reconstructing the semi-directed level-1 phylogenetic network , if we are given a set of SNPs that covers all the bipartitions of , even if the ancestral state is not known, provided that the cycles are of length at least 5; we also prove that an algorithm developed by Dan Gusfield in the in 2005 correctly recovers semi-directed level-1 phylogenetic networks in polynomial time in this case. We present a stochastic model for DNA evolution, and we prove that the two methods (our quartet-based method and Gusfield's method) are statistically consistent estimators of the semi-directed level-1 phylogenetic network. For the case of multi-state homoplasy-free characters, we prove that our quartet-based method correctly constructs semi-directed level-1 networks under the required conditions (all cycles of length at least five), while Gusfield's algorithm cannot be used in that case. These results assume that we have access to an oracle for indicating which sites in the DNA alignment are homoplasy-free, and we show that the methods are robust, under some conditions, to oracle errors.
我们解决了在给定单核苷酸多态性(即SNP,或在无限位点假设下进化的双等位基因标记)时如何估计系统发育网络的问题。我们专注于1级系统发育网络(即循环是节点不相交的网络),因为更复杂的网络是无法识别的。我们提供了一种基于多项式时间四重奏的方法,我们证明如果给定一组覆盖所有二分划的SNP,即使祖先状态未知,只要循环长度至少为5,该方法对于重建半定向1级系统发育网络就是正确的;我们还证明了Dan Gusfield在2005年开发的一种算法在这种情况下能在多项式时间内正确恢复半定向1级系统发育网络。我们提出了一种DNA进化的随机模型,并且证明这两种方法(我们基于四重奏的方法和Gusfield的方法)是半定向1级系统发育网络的统计一致估计量。对于多状态无同塑性特征的情况,我们证明我们基于四重奏的方法在所需条件下(所有循环长度至少为五)能正确构建半定向1级网络,而在这种情况下Gusfield的算法不能使用。这些结果假设我们可以使用一个预言机来指示DNA比对中的哪些位点是无同塑性的,并且我们表明在某些条件下,这些方法对预言机错误具有鲁棒性。