Zhu Tianqi, Dos Reis Mario, Yang Ziheng
Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China Department of Genetics, Evolution and Environment, University College London, Darwin Building, Gower Street, London WC1E 6BT, UK.
Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China Department of Genetics, Evolution and Environment, University College London, Darwin Building, Gower Street, London WC1E 6BT, UK Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China Department of Genetics, Evolution and Environment, University College London, Darwin Building, Gower Street, London WC1E 6BT, UK
Syst Biol. 2015 Mar;64(2):267-80. doi: 10.1093/sysbio/syu109. Epub 2014 Dec 11.
Genetic sequence data provide information about the distances between species or branch lengths in a phylogeny, but not about the absolute divergence times or the evolutionary rates directly. Bayesian methods for dating species divergences estimate times and rates by assigning priors on them. In particular, the prior on times (node ages on the phylogeny) incorporates information in the fossil record to calibrate the molecular tree. Because times and rates are confounded, our posterior time estimates will not approach point values even if an infinite amount of sequence data are used in the analysis. In a previous study we developed a finite-sites theory to characterize the uncertainty in Bayesian divergence time estimation in analysis of large but finite sequence data sets under a strict molecular clock. As most modern clock dating analyses use more than one locus and are conducted under relaxed clock models, here we extend the theory to the case of relaxed clock analysis of data from multiple loci (site partitions). Uncertainty in posterior time estimates is partitioned into three sources: Sampling errors in the estimates of branch lengths in the tree for each locus due to limited sequence length, variation of substitution rates among lineages and among loci, and uncertainty in fossil calibrations. Using a simple but analogous estimation problem involving the multivariate normal distribution, we predict that as the number of loci ([Formula: see text]) goes to infinity, the variance in posterior time estimates decreases and approaches the infinite-data limit at the rate of 1/[Formula: see text], and the limit is independent of the number of sites in the sequence alignment. We then confirmed the predictions by using computer simulation on phylogenies of two or three species, and by analyzing a real genomic data set for six primate species. Our results suggest that with the fossil calibrations fixed, analyzing multiple loci or site partitions is the most effective way for improving the precision of posterior time estimation. However, even if a huge amount of sequence data is analyzed, considerable uncertainty will persist in time estimates.
基因序列数据提供了系统发育中物种间距离或分支长度的信息,但不能直接提供绝对分歧时间或进化速率的信息。用于确定物种分歧时间的贝叶斯方法通过对时间和速率设定先验来估计时间和速率。特别是,时间先验(系统发育树上的节点年龄)纳入了化石记录中的信息来校准分子树。由于时间和速率相互混淆,即使在分析中使用了无限量的序列数据,我们的后验时间估计也不会趋近于点值。在之前的一项研究中,我们开发了一种有限位点理论,以描述在严格分子钟假设下,对大型但有限的序列数据集进行贝叶斯分歧时间估计时的不确定性。由于大多数现代的分子钟定年分析使用多个基因座且是在宽松分子钟模型下进行的,在此我们将该理论扩展到对来自多个基因座(位点分区)的数据进行宽松分子钟分析的情况。后验时间估计的不确定性被分为三个来源:由于序列长度有限导致每个基因座的树中分支长度估计的抽样误差、谱系间和基因座间替换率的变化以及化石校准的不确定性。通过一个涉及多元正态分布的简单但类似的估计问题,我们预测随着基因座数量([公式:见正文])趋于无穷大,后验时间估计的方差会减小,并以1/[公式:见正文]的速率趋近于无限数据极限,且该极限与序列比对中的位点数无关。然后,我们通过对两三个物种的系统发育进行计算机模拟,并分析六个灵长类物种的真实基因组数据集,证实了这些预测。我们的结果表明,在化石校准固定的情况下,分析多个基因座或位点分区是提高后验时间估计精度的最有效方法。然而,即使分析了大量的序列数据,时间估计中仍会存在相当大的不确定性。