Gowri-Shankar Vivek, Rattray Magnus
School of Computer Science, University of Manchester, Manchester M13 9PL, United Kingdom.
Mol Biol Evol. 2006 Feb;23(2):352-64. doi: 10.1093/molbev/msj040. Epub 2005 Oct 19.
Model-based phylogenetic reconstruction methods traditionally assume homogeneity of nucleotide frequencies among sequence sites and lineages. Yet, heterogeneity in base composition is a characteristic shared by most biological sequences. Compositional variation in time, reflected in the compositional biases among contemporary sequences, has already been extensively studied, and its detrimental effects on phylogenetic estimates are known. However, fewer studies have focused on the effects of spatial compositional heterogeneity within genes. We show here that different sites in an alignment do not always share a unique compositional pattern, and we provide examples where nucleotide frequency trends are correlated with the site-specific rate of evolution in RNA genes. Spatial compositional heterogeneity is shown to affect the estimation of evolutionary parameters. With standard phylogenetic methods, estimates of equilibrium frequencies are found to be biased towards the composition observed at fast-evolving sites. Conversely, the ancestral composition estimates of some time-heterogeneous but spatially homogeneous methods are found to be biased towards frequencies observed at invariant and slow-evolving sites. The latter finding challenges the result of a previous study arguing against a hyperthermophilic last universal ancestor from the low apparent G + C content of its rRNA sequences. We propose a new model to account for compositional variation across sites. A Gaussian process prior is used to allow for a smooth change in composition with evolutionary rate. The model has been implemented in the phylogenetic inference software PHASE, and Bayesian methods can be used to obtain the model parameters. The results suggest that this model can accurately capture the observed trends in present-day RNA sequences.
基于模型的系统发育重建方法传统上假定序列位点和谱系之间核苷酸频率的同质性。然而,碱基组成的异质性是大多数生物序列共有的特征。时间上的组成变化反映在当代序列之间的组成偏差中,已经得到了广泛研究,并且其对系统发育估计的有害影响也已为人所知。然而,较少有研究关注基因内空间组成异质性的影响。我们在此表明,比对中的不同位点并不总是共享独特的组成模式,并且我们提供了核苷酸频率趋势与RNA基因中位点特异性进化速率相关的例子。空间组成异质性被证明会影响进化参数的估计。使用标准系统发育方法时,发现平衡频率的估计偏向于在快速进化位点观察到的组成。相反,发现一些时间异质但空间同质方法的祖先组成估计偏向于在不变和缓慢进化位点观察到的频率。后一发现对先前一项研究的结果提出了挑战,该研究基于其rRNA序列的低表观G + C含量反对超嗜热的最后普遍共同祖先。我们提出了一个新模型来解释位点间的组成变化。使用高斯过程先验来允许组成随进化速率平滑变化。该模型已在系统发育推断软件PHASE中实现,并且可以使用贝叶斯方法来获得模型参数。结果表明该模型可以准确捕捉当今RNA序列中观察到的趋势。