Whelan Simon
Faculty of Life Sciences, University of Manchester, Michael Smith Building, Manchester M13 9PT, United Kingdom.
Mol Biol Evol. 2008 Aug;25(8):1683-94. doi: 10.1093/molbev/msn119. Epub 2008 May 22.
Models of nucleotide substitution make many simplifying assumptions about the evolutionary process, including that the same process acts on all sites in an alignment and on all branches on the phylogenetic tree. Many studies have shown that in reality the substitution process is heterogeneous and that this variability can introduce systematic errors into many forms of phylogenetic analyses. I propose a new rigorous approach for describing heterogeneity called a temporal hidden Markov model (THMM), which can distinguish between among site (spatial) heterogeneity and among lineage (temporal) heterogeneity. Several versions of the THMM are applied to 16 sets of aligned sequences to quantitatively assess the different forms of heterogeneity acting within them. The most general THMM provides the best fit in all the data sets examined, providing strong evidence of pervasive heterogeneity during evolution. Investigating individual forms of heterogeneity provides further insights. In agreement with previous studies, spatial rate heterogeneity (rates across sites [RAS]) is inferred to be the single most prevalent form of heterogeneity. Interestingly, RAS appears so dominant that failure to independently include it in the THMM masks other forms of heterogeneity, particularly temporal heterogeneity. Incorporating RAS into the THMM reveals substantial temporal and spatial heterogeneity in nucleotide composition and bias toward transition substitution in all alignments examined, although the relative importance of different forms of heterogeneity varies between data sets. Furthermore, the improvements in model fit observed by adding complexity to the model suggest that the THMMs used in this study do not capture all the evolutionary heterogeneity occurring in the data. These observations all indicate that current tests may consistently underestimate the degree of temporal heterogeneity occurring in data. Finally, there is a weak link between the amount of heterogeneity detected and the level of divergence between the sequences, suggesting that variability in the evolutionary process will be a particular problem for deep phylogeny.
核苷酸替换模型对进化过程做了许多简化假设,包括同一过程作用于比对中的所有位点以及系统发育树的所有分支。许多研究表明,实际上替换过程是异质性的,这种变异性会给多种形式的系统发育分析引入系统误差。我提出一种新的严格方法来描述异质性,称为时间隐马尔可夫模型(THMM),它可以区分位点间(空间)异质性和谱系间(时间)异质性。将几种版本的THMM应用于16组比对序列,以定量评估其中存在的不同形式的异质性。最通用的THMM在所有检验的数据集中拟合效果最佳,有力证明了进化过程中普遍存在异质性。对个体形式的异质性进行研究能提供更多见解。与先前研究一致,空间速率异质性(位点间速率[RAS])被推断为最普遍的异质性形式。有趣的是,RAS显得如此占主导地位,以至于在THMM中未能独立纳入它会掩盖其他形式的异质性,尤其是时间异质性。将RAS纳入THMM后发现,在所检验的所有比对中,核苷酸组成存在大量时间和空间异质性,且偏向于转换替换,尽管不同形式异质性的相对重要性在不同数据集之间有所不同。此外,通过增加模型复杂度观察到的模型拟合改进表明,本研究中使用的THMM并未捕捉到数据中发生的所有进化异质性。这些观察结果都表明,当前的检验可能会持续低估数据中发生的时间异质性程度。最后,检测到的异质性量与序列间的分歧水平之间存在微弱联系,这表明进化过程中的变异性对于深层次系统发育将是一个特别的问题。