具有长分支的大系统发育推断：多长算太长？

Weizmann Institute, Rehovot, Israel.

Bull Math Biol. 2011 Jul;73(7):1627-44. doi: 10.1007/s11538-010-9584-6. Epub 2010 Oct 8.

The accurate reconstruction of phylogenies from short molecular sequences is an important problem in computational biology. Recent work has highlighted deep connections between sequence-length requirements for high-probability phylogeny reconstruction and the related problem of the estimation of ancestral sequences. In Daskalakis et al. (in Probab. Theory Relat. Fields 2010), building on the work of Mossel (Trans. Am. Math. Soc. 356(6):2379-2404, 2004), a tight sequence-length requirement was obtained for the simple CFN model of substitution, that is, the case of a two-state symmetric rate matrix Q. In particular the required sequence length for high-probability reconstruction was shown to undergo a sharp transition (from O(log n) to poly(n), where n is the number of leaves) at the "critical" branch length g (ML)(Q) (if it exists) of the ancestral reconstruction problem defined roughly as follows: below g (ML)(Q) the sequence at the root can be accurately estimated from sequences at the leaves on deep trees, whereas above g (ML)(Q) information decays exponentially quickly down the tree.Here, we consider a more general evolutionary model, the GTR model, where the q×q rate matrix Q is reversible with q≥2. For this model, recent results of Roch (Preprint, 2009) show that the tree can be accurately reconstructed with sequences of length O(log (n)) when the branch lengths are below g (Lin)(Q), known as the Kesten-Stigum (KS) bound, up to which ancestral sequences can be accurately estimated using simple linear estimators. Although for the CFN model g (ML)(Q)=g (Lin)(Q) (in other words, linear ancestral estimators are in some sense best possible), it is known that for the more general GTR models one has g (ML)(Q)≥g (Lin)(Q) with a strict inequality in many cases. Here, we show that this phenomenon also holds for phylogenetic reconstruction by exhibiting a family of symmetric models Q and a phylogenetic reconstruction algorithm which recovers the tree from O(log n)-length sequences for some branch lengths in the range (g (Lin)(Q),g (ML)(Q)). Second, we prove that phylogenetic reconstruction under GTR models requires a polynomial sequence-length for branch lengths above g (ML)(Q).

从短分子序列准确重建系统发育是计算生物学中的一个重要问题。最近的工作强调了序列长度要求对于高概率系统发育重建和相关的祖先序列估计问题之间的深刻联系。在 Daskalakis 等人的研究中（发表于《概率论及其相关领域》2010 年），基于 Mossel 的工作（《美国数学学会会刊》356(6):2379-2404, 2004），对于取代的简单 CFN 模型（即二态对称速率矩阵 Q 的情况），获得了一个紧密的序列长度要求。特别是，对于高概率重建所需的序列长度，在祖先重建问题的“临界”分支长度 g (ML)(Q)（如果存在）处发生了急剧转变（从 O(log n)到 poly(n)，其中 n 是叶子的数量），该问题大致定义为：在 g (ML)(Q)以下，根处的序列可以从深树上的叶子处的序列准确估计，而在 g (ML)(Q)以上，信息沿树快速指数衰减。在这里，我们考虑一个更一般的进化模型，GTR 模型，其中 q×q 速率矩阵 Q 是可逆的，q≥2。对于这个模型，Roch 的最新结果（预印本，2009）表明，当分支长度低于 g (Lin)(Q)（称为 Kesten-Stigum（KS）界）时，可以使用长度为 O(log n)的序列准确重建树，直到这个界限，使用简单的线性估计器可以准确估计祖先序列。尽管对于 CFN 模型 g (ML)(Q)=g (Lin)(Q)（换句话说，线性祖先估计器在某种意义上是最佳的），但已知对于更一般的 GTR 模型，在许多情况下 g (ML)(Q)>g (Lin)(Q)，并且严格不等式成立。在这里，我们通过展示一个对称模型 Q 的家族和一个从 O(log n)长度序列中恢复树的系统发育重建算法，表明这种现象也适用于系统发育重建，对于某些分支长度在（g (Lin)(Q),g (ML)(Q)）范围内的情况。其次，我们证明了在 GTR 模型下进行系统发育重建需要多项式长度的序列，用于分支长度高于 g (ML)(Q)的情况。

相似文献

On the inference of large phylogenies with long branches: how long is too long?

Bull Math Biol. 2011 Jul;73(7):1627-44. doi: 10.1007/s11538-010-9584-6. Epub 2010 Oct 8.

Sequence length bounds for resolving a deep phylogenetic divergence.

J Theor Biol. 2009 Jan 21;256(2):247-52. doi: 10.1016/j.jtbi.2008.09.031. Epub 2008 Oct 11.

Evaluating the relationship between evolutionary divergence and phylogenetic accuracy in AFLP data sets.

Mol Biol Evol. 2010 May;27(5):988-1000. doi: 10.1093/molbev/msp315. Epub 2009 Dec 21.

The devil in the details: interactions between the branch-length prior and likelihood model affect node support and branch lengths in the phylogeny of the Psoraceae.

Syst Biol. 2011 Jul;60(4):541-61. doi: 10.1093/sysbio/syr022. Epub 2011 Mar 24.

The impact and interplay of long and short branches on phylogenetic information content.

J Theor Biol. 2012 Dec 7;314:157-63. doi: 10.1016/j.jtbi.2012.08.040. Epub 2012 Sep 7.

Modelling heterotachy in phylogenetic inference by reversible-jump Markov chain Monte Carlo.

Philos Trans R Soc Lond B Biol Sci. 2008 Dec 27;363(1512):3955-64. doi: 10.1098/rstb.2008.0178.

A signal-to-noise analysis of phylogeny estimation by neighbor-joining: Insufficiency of polynomial length sequences.

Math Biosci. 2006 Feb;199(2):188-215. doi: 10.1016/j.mbs.2005.11.003. Epub 2006 Jan 18.

Is homoplasy or lineage sorting the source of incongruent mtdna and nuclear gene trees in the stiff-tailed ducks (Nomonyx-Oxyura)?

Syst Biol. 2005 Feb;54(1):35-55. doi: 10.1080/10635150590910249.

Fast computation of distance estimators.

BMC Bioinformatics. 2007 Mar 13;8:89. doi: 10.1186/1471-2105-8-89.

Fast recovery of evolutionary trees with thousands of nodes.

J Comput Biol. 2002;9(2):277-97. doi: 10.1089/10665270252935467.

引用本文的文献

Ages, sizes and (trees within) trees of taxa and of urns, from Yule to today.

Philos Trans R Soc Lond B Biol Sci. 2025 Feb 13;380(1919):20230305. doi: 10.1098/rstb.2023.0305. Epub 2025 Feb 20.

Bayesian Inference of Phylogenetic Distances: Revisiting the Eigenvalue Approach.

Bull Math Biol. 2025 Jan 23;87(2):32. doi: 10.1007/s11538-024-01403-z.

Predicting Long Pendant Edges in Model Phylogenies, with Applications to Biodiversity and Tree Inference.

Syst Biol. 2023 Jun 17;72(3):575-589. doi: 10.1093/sysbio/syac059.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

On the inference of large phylogenies with long branches: how long is too long?

Bull Math Biol. 2011 Jul;73(7):1627-44. doi: 10.1007/s11538-010-9584-6. Epub 2010 Oct 8.

Sequence length bounds for resolving a deep phylogenetic divergence.

J Theor Biol. 2009 Jan 21;256(2):247-52. doi: 10.1016/j.jtbi.2008.09.031. Epub 2008 Oct 11.

Evaluating the relationship between evolutionary divergence and phylogenetic accuracy in AFLP data sets.

Mol Biol Evol. 2010 May;27(5):988-1000. doi: 10.1093/molbev/msp315. Epub 2009 Dec 21.

The devil in the details: interactions between the branch-length prior and likelihood model affect node support and branch lengths in the phylogeny of the Psoraceae.

Syst Biol. 2011 Jul;60(4):541-61. doi: 10.1093/sysbio/syr022. Epub 2011 Mar 24.

The impact and interplay of long and short branches on phylogenetic information content.

J Theor Biol. 2012 Dec 7;314:157-63. doi: 10.1016/j.jtbi.2012.08.040. Epub 2012 Sep 7.

Modelling heterotachy in phylogenetic inference by reversible-jump Markov chain Monte Carlo.

Philos Trans R Soc Lond B Biol Sci. 2008 Dec 27;363(1512):3955-64. doi: 10.1098/rstb.2008.0178.

A signal-to-noise analysis of phylogeny estimation by neighbor-joining: Insufficiency of polynomial length sequences.

Math Biosci. 2006 Feb;199(2):188-215. doi: 10.1016/j.mbs.2005.11.003. Epub 2006 Jan 18.

Is homoplasy or lineage sorting the source of incongruent mtdna and nuclear gene trees in the stiff-tailed ducks (Nomonyx-Oxyura)?

Syst Biol. 2005 Feb;54(1):35-55. doi: 10.1080/10635150590910249.

Fast computation of distance estimators.

BMC Bioinformatics. 2007 Mar 13;8:89. doi: 10.1186/1471-2105-8-89.

Fast recovery of evolutionary trees with thousands of nodes.

J Comput Biol. 2002;9(2):277-97. doi: 10.1089/10665270252935467.

引用本文的文献

Ages, sizes and (trees within) trees of taxa and of urns, from Yule to today.

Philos Trans R Soc Lond B Biol Sci. 2025 Feb 13;380(1919):20230305. doi: 10.1098/rstb.2023.0305. Epub 2025 Feb 20.

Bayesian Inference of Phylogenetic Distances: Revisiting the Eigenvalue Approach.

Bull Math Biol. 2025 Jan 23;87(2):32. doi: 10.1007/s11538-024-01403-z.

Predicting Long Pendant Edges in Model Phylogenies, with Applications to Biodiversity and Tree Inference.

Syst Biol. 2023 Jun 17;72(3):575-589. doi: 10.1093/sysbio/syac059.

On the inference of large phylogenies with long branches: how long is too long?

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献