Department of Genetics, Evolution and Environment, University College London, Darwin Building, Gower Street, London WC1E 6BT, UK;
Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.
Syst Biol. 2014 Jul;63(4):555-65. doi: 10.1093/sysbio/syu020. Epub 2014 Mar 21.
Bayesian methods provide a powerful way to estimate species divergence times by combining information from molecular sequences with information from the fossil record. With the explosive increase of genomic data, divergence time estimation increasingly uses data of multiple loci (genes or site partitions). Widely used computer programs to estimate divergence times use independent and identically distributed (i.i.d.) priors on the substitution rates for different loci. The i.i.d. prior is problematic. As the number of loci (L) increases, the prior variance of the average rate across all loci goes to zero at the rate 1/L. As a consequence, the rate prior dominates posterior time estimates when many loci are analyzed, and if the rate prior is misspecified, the estimated divergence times will converge to wrong values with very narrow credibility intervals. Here we develop a new prior on the locus rates based on the Dirichlet distribution that corrects the problematic behavior of the i.i.d. prior. We use computer simulation and real data analysis to highlight the differences between the old and new priors. For a dataset for six primate species, we show that with the old i.i.d. prior, if the prior rate is too high (or too low), the estimated divergence times are too young (or too old), outside the bounds imposed by the fossil calibrations. In contrast, with the new Dirichlet prior, posterior time estimates are insensitive to the rate prior and are compatible with the fossil calibrations. We re-analyzed a phylogenomic data set of 36 mammal species and show that using many fossil calibrations can alleviate the adverse impact of a misspecified rate prior to some extent. We recommend the use of the new Dirichlet prior in Bayesian divergence time estimation. [Bayesian inference, divergence time, relaxed clock, rate prior, partition analysis.].
贝叶斯方法通过将分子序列信息与化石记录信息相结合,为估计物种分歧时间提供了一种强大的方法。随着基因组数据的爆炸式增长,分歧时间估计越来越多地使用多个基因座(基因或位点分区)的数据。广泛用于估计分歧时间的计算机程序对不同基因座的替代率使用独立同分布(iid)先验。iid 先验存在问题。随着基因座数量(L)的增加,所有基因座平均速率的先验方差以 1/L 的速率趋于零。因此,当分析许多基因座时,速率先验会主导后验时间估计,如果速率先验被错误指定,则估计的分歧时间将收敛到错误的值,可信度区间非常狭窄。在这里,我们基于狄利克雷分布为基因座速率开发了一个新的先验,该先验可以纠正 iid 先验的问题行为。我们使用计算机模拟和真实数据分析来突出新旧先验之间的差异。对于六个灵长类物种的数据集,我们表明,使用旧的 iid 先验,如果先验速率过高(或过低),则估计的分歧时间太年轻(或太老),超出了化石校准施加的限制。相比之下,使用新的狄利克雷先验,后验时间估计对速率先验不敏感,并且与化石校准兼容。我们重新分析了 36 种哺乳动物的系统基因组数据集,并表明使用许多化石校准可以在一定程度上减轻指定错误的速率先验的不利影响。我们建议在贝叶斯分歧时间估计中使用新的狄利克雷先验。