Yang Ziheng, Rannala Bruce
Department of Biology, University College London, London, United Kingdom.
Mol Biol Evol. 2006 Jan;23(1):212-26. doi: 10.1093/molbev/msj024. Epub 2005 Sep 21.
We implement a Bayesian Markov chain Monte Carlo algorithm for estimating species divergence times that uses heterogeneous data from multiple gene loci and accommodates multiple fossil calibration nodes. A birth-death process with species sampling is used to specify a prior for divergence times, which allows easy assessment of the effects of that prior on posterior time estimates. We propose a new approach for specifying calibration points on the phylogeny, which allows the use of arbitrary and flexible statistical distributions to describe uncertainties in fossil dates. In particular, we use soft bounds, so that the probability that the true divergence time is outside the bounds is small but nonzero. A strict molecular clock is assumed in the current implementation, although this assumption may be relaxed. We apply our new algorithm to two data sets concerning divergences of several primate species, to examine the effects of the substitution model and of the prior for divergence times on Bayesian time estimation. We also conduct computer simulation to examine the differences between soft and hard bounds. We demonstrate that divergence time estimation is intrinsically hampered by uncertainties in fossil calibrations, and the error in Bayesian time estimates will not go to zero with increased amounts of sequence data. Our analyses of both real and simulated data demonstrate potentially large differences between divergence time estimates obtained using soft versus hard bounds and a general superiority of soft bounds. Our main findings are as follows. (1) When the fossils are consistent with each other and with the molecular data, and the posterior time estimates are well within the prior bounds, soft and hard bounds produce similar results. (2) When the fossils are in conflict with each other or with the molecules, soft and hard bounds behave very differently; soft bounds allow sequence data to correct poor calibrations, while poor hard bounds are impossible to overcome by any amount of data. (3) Soft bounds eliminate the need for "safe" but unrealistically high upper bounds, which may bias posterior time estimates. (4) Soft bounds allow more reliable assessment of estimation errors, while hard bounds generate misleadingly high precisions when fossils and molecules are in conflict.
我们实现了一种贝叶斯马尔可夫链蒙特卡罗算法,用于估计物种分化时间,该算法使用来自多个基因座的异质数据,并容纳多个化石校准节点。采用带物种抽样的出生-死亡过程来指定分化时间的先验分布,这使得能够轻松评估该先验分布对后验时间估计的影响。我们提出了一种在系统发育树上指定校准点的新方法,该方法允许使用任意且灵活的统计分布来描述化石年代的不确定性。具体而言,我们使用软边界,使得真实分化时间超出边界的概率很小但不为零。当前实现中假设了严格的分子钟,不过这一假设可以放宽。我们将新算法应用于两个关于几种灵长类物种分化的数据集,以检验替代模型和分化时间先验分布对贝叶斯时间估计的影响。我们还进行了计算机模拟,以检验软边界和硬边界之间的差异。我们证明,化石校准中的不确定性本质上阻碍了分化时间估计,并且贝叶斯时间估计中的误差不会随着序列数据量的增加而趋于零。我们对真实数据和模拟数据的分析表明,使用软边界和硬边界获得的分化时间估计可能存在很大差异,并且软边界总体上更具优势。我们的主要发现如下:(1)当化石相互一致且与分子数据一致,并且后验时间估计完全在先验边界内时,软边界和硬边界产生相似的结果。(2)当化石相互冲突或与分子冲突时,软边界和硬边界的表现非常不同;软边界允许序列数据纠正不良校准,而不良硬边界则无法被任何数量的数据克服。(3)软边界无需设置“安全”但不切实际的高上限,这可能会使后验时间估计产生偏差。(4)软边界允许对估计误差进行更可靠的评估,而当化石和分子冲突时,硬边界会产生误导性的高精度。