Weizmann Institute, Rehovot, Israel.
Bull Math Biol. 2011 Jul;73(7):1627-44. doi: 10.1007/s11538-010-9584-6. Epub 2010 Oct 8.
The accurate reconstruction of phylogenies from short molecular sequences is an important problem in computational biology. Recent work has highlighted deep connections between sequence-length requirements for high-probability phylogeny reconstruction and the related problem of the estimation of ancestral sequences. In Daskalakis et al. (in Probab. Theory Relat. Fields 2010), building on the work of Mossel (Trans. Am. Math. Soc. 356(6):2379-2404, 2004), a tight sequence-length requirement was obtained for the simple CFN model of substitution, that is, the case of a two-state symmetric rate matrix Q. In particular the required sequence length for high-probability reconstruction was shown to undergo a sharp transition (from O(log n) to poly(n), where n is the number of leaves) at the "critical" branch length g (ML)(Q) (if it exists) of the ancestral reconstruction problem defined roughly as follows: below g (ML)(Q) the sequence at the root can be accurately estimated from sequences at the leaves on deep trees, whereas above g (ML)(Q) information decays exponentially quickly down the tree.Here, we consider a more general evolutionary model, the GTR model, where the q×q rate matrix Q is reversible with q≥2. For this model, recent results of Roch (Preprint, 2009) show that the tree can be accurately reconstructed with sequences of length O(log (n)) when the branch lengths are below g (Lin)(Q), known as the Kesten-Stigum (KS) bound, up to which ancestral sequences can be accurately estimated using simple linear estimators. Although for the CFN model g (ML)(Q)=g (Lin)(Q) (in other words, linear ancestral estimators are in some sense best possible), it is known that for the more general GTR models one has g (ML)(Q)≥g (Lin)(Q) with a strict inequality in many cases. Here, we show that this phenomenon also holds for phylogenetic reconstruction by exhibiting a family of symmetric models Q and a phylogenetic reconstruction algorithm which recovers the tree from O(log n)-length sequences for some branch lengths in the range (g (Lin)(Q),g (ML)(Q)). Second, we prove that phylogenetic reconstruction under GTR models requires a polynomial sequence-length for branch lengths above g (ML)(Q).
从短分子序列准确重建系统发育是计算生物学中的一个重要问题。最近的工作强调了序列长度要求对于高概率系统发育重建和相关的祖先序列估计问题之间的深刻联系。在 Daskalakis 等人的研究中(发表于《概率论及其相关领域》2010 年),基于 Mossel 的工作(《美国数学学会会刊》356(6):2379-2404, 2004),对于取代的简单 CFN 模型(即二态对称速率矩阵 Q 的情况),获得了一个紧密的序列长度要求。特别是,对于高概率重建所需的序列长度,在祖先重建问题的“临界”分支长度 g (ML)(Q)(如果存在)处发生了急剧转变(从 O(log n)到 poly(n),其中 n 是叶子的数量),该问题大致定义为:在 g (ML)(Q)以下,根处的序列可以从深树上的叶子处的序列准确估计,而在 g (ML)(Q)以上,信息沿树快速指数衰减。在这里,我们考虑一个更一般的进化模型,GTR 模型,其中 q×q 速率矩阵 Q 是可逆的,q≥2。对于这个模型,Roch 的最新结果(预印本,2009)表明,当分支长度低于 g (Lin)(Q)(称为 Kesten-Stigum(KS)界)时,可以使用长度为 O(log n)的序列准确重建树,直到这个界限,使用简单的线性估计器可以准确估计祖先序列。尽管对于 CFN 模型 g (ML)(Q)=g (Lin)(Q)(换句话说,线性祖先估计器在某种意义上是最佳的),但已知对于更一般的 GTR 模型,在许多情况下 g (ML)(Q)>g (Lin)(Q),并且严格不等式成立。在这里,我们通过展示一个对称模型 Q 的家族和一个从 O(log n)长度序列中恢复树的系统发育重建算法,表明这种现象也适用于系统发育重建,对于某些分支长度在(g (Lin)(Q),g (ML)(Q))范围内的情况。其次,我们证明了在 GTR 模型下进行系统发育重建需要多项式长度的序列,用于分支长度高于 g (ML)(Q)的情况。