Kaehler Benjamin D, Yap Von Bing, Zhang Rongli, Huttley Gavin A
John Curtin School of Medical Research, Australian National University, Canberra, ACT, 2600, Australia; and.
Department of Statistics and Applied Probability, National University of Singapore, Singapore, 117546, Singapore.
Syst Biol. 2015 Mar;64(2):281-93. doi: 10.1093/sysbio/syu106. Epub 2014 Dec 9.
The genetic distance between biological sequences is a fundamental quantity in molecular evolution. It pertains to questions of rates of evolution, existence of a molecular clock, and phylogenetic inference. Under the class of continuous-time substitution models, the distance is commonly defined as the expected number of substitutions at any site in the sequence. We eschew the almost ubiquitous assumptions of evolution under stationarity and time-reversible conditions and extend the concept of the expected number of substitutions to nonstationary Markov models where the only remaining constraint is of time homogeneity between nodes in the tree. Our measure of genetic distance reduces to the standard formulation if the data in question are consistent with the stationarity assumption. We apply this general model to samples from across the tree of life to compare distances so obtained with those from the general time-reversible model, with and without rate heterogeneity across sites, and the paralinear distance, an empirical pairwise method explicitly designed to address nonstationarity. We discover that estimates from both variants of the general time-reversible model and the paralinear distance systematically overestimate genetic distance and departure from the molecular clock. The magnitude of the distance bias is proportional to departure from stationarity, which we demonstrate to be associated with longer edge lengths. The marked improvement in consistency between the general nonstationary Markov model and sequence alignments leads us to conclude that analyses of evolutionary rates and phylogenies will be substantively improved by application of this model.
生物序列之间的遗传距离是分子进化中的一个基本量。它涉及到进化速率、分子钟的存在以及系统发育推断等问题。在连续时间替换模型类别下,该距离通常被定义为序列中任意位点的预期替换数。我们摒弃了几乎无处不在的平稳性和时间可逆性条件下的进化假设,并将预期替换数的概念扩展到非平稳马尔可夫模型,其中唯一剩下的约束是树中节点之间的时间齐次性。如果所讨论的数据与平稳性假设一致,我们的遗传距离度量就会简化为标准公式。我们将这个通用模型应用于来自生命之树各处的样本,以比较由此获得的距离与来自通用时间可逆模型(有无位点间速率异质性)以及平行线性距离(一种专门为解决非平稳性而设计的经验性成对方法)的距离。我们发现,通用时间可逆模型的两种变体和平行线性距离的估计都系统性地高估了遗传距离以及与分子钟的偏差。距离偏差的大小与偏离平稳性成正比,我们证明这与更长的分支长度有关。通用非平稳马尔可夫模型与序列比对之间一致性的显著改善使我们得出结论,应用该模型将实质性地改进进化速率和系统发育的分析。