Fernandes Andrew D, Atchley William R
Department of Biochemistry, The University of Western Ontario, London, Ontario N6A5C1, Canada.
Bioinformatics. 2008 Oct 1;24(19):2177-83. doi: 10.1093/bioinformatics/btn395. Epub 2008 Jul 28.
In a nucleotide or amino acid sequence, not all sites evolve at the same rate, due to differing selective constraints at each site. Currently in computational molecular evolution, models incorporating rate heterogeneity always share two assumptions. First, the rate of evolution at each site is assumed to be independent of every other site. Second, the values of these rates are assumed to be drawn from a known prior distribution. Although often assumed to be small, the actual effect of these assumptions has not been previously quantified in the literature.
Herein we describe an algorithm to simultaneously infer the set of n-1 relative rates that parameterize the likelihood of an n-site alignment. Unlike previous work (a) these relative rates are completely identifiable and distinct from the branch-length parameters, and (b) a far more general class of rate priors can be used, and their effects quantified. Although described in a Bayesian framework, we discuss a future maximum likelihood extension.
Using both synthetic data and alignments from the Myc, Max and p53 protein families, we find that inferring relative rather than absolute rates has several advantages. First, both empirical likelihoods and Bayes factors show strong preference for the relative-rate model, with a mean Delta ln P=-0.458 per alignment site. Second, the computed likelihoods and Bayes factors were essentially independent of the relative-rate prior, indicating that good estimates of the posterior rate distribution are not required a priori. Third, a novel finding is that rates can be accurately inferred even when up to approximately 4 substitutions per site have occurred. Thus biologically relevant putative hypervariable sites can be identified as easily as conserved sites. Lastly, our model treats rates and tree branch-lengths as completely identifiable, allowing for the first time coherent simultaneous inference of branch-lengths and site-specific evolutionary rates.
Source code for the utility described is available under a BSD-style license at http://www.fernandes.org/txp/article/9/site-specific-relative-evolutionary-rates.
在核苷酸或氨基酸序列中,由于每个位点受到的选择约束不同,并非所有位点都以相同的速率进化。目前在计算分子进化中,纳入速率异质性的模型总是共享两个假设。第一,假设每个位点的进化速率与其他任何位点无关。第二,假设这些速率的值来自已知的先验分布。尽管通常认为这些假设的影响较小,但此前文献中尚未对其实际影响进行量化。
在此我们描述一种算法,用于同时推断一组n - 1个相对速率,这些速率参数化了n个位点比对的似然性。与先前的工作不同,(a)这些相对速率是完全可识别的,并且与分支长度参数不同;(b)可以使用更广泛的一类速率先验,并对其影响进行量化。尽管是在贝叶斯框架下描述的,但我们讨论了未来的最大似然扩展。
使用来自Myc、Max和p53蛋白家族的合成数据和比对,我们发现推断相对速率而非绝对速率有几个优点。第一,经验似然性和贝叶斯因子都强烈偏好相对速率模型,每个比对位点的平均Δln P = -0.458。第二,计算出的似然性和贝叶斯因子基本上与相对速率先验无关,这表明无需先验地对后验速率分布进行良好估计。第三,一个新发现是,即使每个位点发生多达约4次替换,速率也能被准确推断。因此,生物学上相关的假定高变位点可以像保守位点一样容易地被识别。最后,我们的模型将速率和树分支长度视为完全可识别的,首次允许对分支长度和位点特异性进化速率进行连贯的同时推断。
所描述实用程序的源代码可在http://www.fernandes.org/txp/article/9/site-specific-relative-evolutionary-rates 以BSD风格许可获取。