Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA.
Department of Biology, Temple University, Philadelphia, PA.
Mol Biol Evol. 2020 Jun 1;37(6):1819-1831. doi: 10.1093/molbev/msaa049.
The conventional wisdom in molecular evolution is to apply parameter-rich models of nucleotide and amino acid substitutions for estimating divergence times. However, the actual extent of the difference between time estimates produced by highly complex models compared with those from simple models is yet to be quantified for contemporary data sets that frequently contain sequences from many species and genes. In a reanalysis of many large multispecies alignments from diverse groups of taxa, we found that the use of the simplest models can produce divergence time estimates and credibility intervals similar to those obtained from the complex models applied in the original studies. This result is surprising because the use of simple models underestimates sequence divergence for all the data sets analyzed. We found three fundamental reasons for the observed robustness of time estimates to model complexity in many practical data sets. First, the estimates of branch lengths and node-to-tip distances under the simplest model show an approximately linear relationship with those produced by using the most complex models applied on data sets with many sequences. Second, relaxed clock methods automatically adjust rates on branches that experience considerable underestimation of sequence divergences, resulting in time estimates that are similar to those from complex models. And, third, the inclusion of even a few good calibrations in an analysis can reduce the difference in time estimates from simple and complex models. The robustness of time estimates to model complexity in these empirical data analyses is encouraging, because all phylogenomics studies use statistical models that are oversimplified descriptions of actual evolutionary substitution processes.
分子进化的传统观点是应用核苷酸和氨基酸替换的参数丰富模型来估计分歧时间。然而,对于经常包含来自许多物种和基因的序列的当代数据集,尚未对高度复杂模型与简单模型产生的时间估计值之间的实际差异程度进行量化。在对来自不同分类群的许多大型多物种排列的重新分析中,我们发现使用最简单的模型可以产生与原始研究中应用的复杂模型获得的分歧时间估计值和置信区间相似的结果。由于简单模型低估了所有分析数据集的序列分歧,因此该结果令人惊讶。我们发现,在许多实际数据集模型复杂性中,时间估计值的稳健性存在三个基本原因。首先,最简单模型下的分支长度和节点到尖端距离的估计值与在具有许多序列的数据集上应用的最复杂模型产生的估计值之间呈近似线性关系。其次,松弛时钟方法自动调整经历序列分歧低估的分支上的速率,从而产生与复杂模型相似的时间估计值。第三,在分析中包含即使只有几个良好的校准点也可以减少简单和复杂模型之间的时间估计值差异。这些经验数据分析中时间估计值对模型复杂性的稳健性令人鼓舞,因为所有基因组学研究都使用简化的统计模型,这些模型是对实际进化替代过程的过度简化描述。