当树木生长过长时:探究贝叶斯分支长度估计高度不准确的原因。
When trees grow too long: investigating the causes of highly inaccurate bayesian branch-length estimates.
机构信息
Section of Integrative Biology and Center for Computational Biology and Bioinformatics, University of Texas at Austin, 1 University Station C0930, Austin, TX 78712, USA.
出版信息
Syst Biol. 2010 Mar;59(2):145-61. doi: 10.1093/sysbio/syp081. Epub 2009 Dec 10.
A surprising number of recent Bayesian phylogenetic analyses contain branch-length estimates that are several orders of magnitude longer than corresponding maximum-likelihood estimates. The levels of divergence implied by such branch lengths are unreasonable for studies using biological data and are known to be false for studies using simulated data. We conducted additional Bayesian analyses and studied approximate-posterior surfaces to investigate the causes underlying these large errors. We manipulated the starting parameter values of the Markov chain Monte Carlo (MCMC) analyses, the moves used by the MCMC analyses, and the prior-probability distribution on branch lengths. We demonstrate that inaccurate branch-length estimates result from either 1) poor mixing of MCMC chains or 2) posterior distributions with excessive weight at long tree lengths. Both effects are caused by a rapid increase in the volume of branch-length space as branches become longer. In the former case, both an MCMC move that scales all branch lengths in the tree simultaneously and the use of overdispersed starting branch lengths allow the chain to accurately sample the posterior distribution and should be used in Bayesian analyses of phylogeny. In the latter case, branch-length priors can have strong effects on resulting inferences and should be carefully chosen to reflect biological expectations. We provide a formula to calculate an exponential rate parameter for the branch-length prior that should eliminate inference of biased branch lengths in many cases. In any phylogenetic analysis, the biological plausibility of branch-length output must be carefully considered.
最近有相当数量的贝叶斯系统发育分析包含的分支长度估计值比相应的最大似然估计值长几个数量级。这种分支长度所暗示的分歧程度对于使用生物数据的研究来说是不合理的,对于使用模拟数据的研究来说也是已知的错误。我们进行了额外的贝叶斯分析,并研究了近似后验曲面,以调查这些大误差的根本原因。我们操纵了马尔可夫链蒙特卡罗(MCMC)分析的起始参数值、MCMC 分析中使用的移动以及分支长度的先验概率分布。我们证明,不准确的分支长度估计是由于 1)MCMC 链的混合不良,或 2)后验分布在树长较长时有过多的权重。这两种效应都是由于分支长度空间的体积随着分支的增长而迅速增加而引起的。在前一种情况下,同时缩放树中所有分支长度的 MCMC 移动和使用过度分散的起始分支长度允许链准确地对后验分布进行采样,并且应该在贝叶斯系统发育分析中使用。在后一种情况下,分支长度先验可以对产生的推断产生强烈影响,并且应该仔细选择以反映生物学预期。我们提供了一个计算分支长度先验的指数率参数的公式,该公式应该可以消除在许多情况下对有偏差的分支长度的推断。在任何系统发育分析中,都必须仔细考虑分支长度输出的生物学合理性。