Blouin C, Butt D, Roger A J
Genome Atlantic, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada.
Mol Biol Evol. 2005 Mar;22(3):784-91. doi: 10.1093/molbev/msi065. Epub 2004 Dec 8.
The function of individual sites within a protein influences their rate of accepted point mutation. During the computation of phylogenetic likelihoods, rate heterogeneity can be modeled on a site-per-site basis with relative rates drawn from a discretized Gamma-distribution. Site-rate estimates (e.g., the rate of highest posterior probability given the data at a site) can then be used as a measure of evolutionary constraints imposed by function. However, if the sequence availability is limited, the estimation of rates is subject to sampling error. This article presents a simulation study that evaluates the robustness of evolutionary site-rate estimates for both small and phylogenetically unbalanced samples. The sampling error on rate estimates was first evaluated for alignments that included 5-45 sequences, sampled by jackknifing, from a master alignment containing 968 sequences. We observed that the potentially enhanced resolution among site rates due to the inclusion of a larger number of rate categories is negated by the difficulty in correctly estimating intermediate rates. This effect is marked for data sets with less than 30 sequences. Although the computation of likelihood theoretically accounts for phylogenetic distances through branch lengths, the introduction of a single long-branch outlier sequence had a significant negative effect on site-rate estimates. Finally, the presence of a shift in rates of evolution between related lineages can be diagnostic of a gain/loss of function within a protein family. Our analyses indicate that detecting these rate shifts is a harder problem than estimating rates. This is so, partially, because the difference in rates depends on two rate estimates, each with an intrinsic uncertainty. The performances of four methods to detect these site-rate shifts are evaluated and compared. Guidelines are suggested for preparing data sets minimally influenced by error introduced by sequence sampling.
蛋白质中各个位点的功能会影响其接受点突变的速率。在系统发育似然性计算过程中,速率异质性可以在逐个位点的基础上进行建模,相对速率取自离散化的伽马分布。然后,位点速率估计值(例如,给定一个位点的数据时最高后验概率的速率)可以用作衡量功能所施加的进化约束的指标。然而,如果序列可用性有限,速率估计会受到抽样误差的影响。本文进行了一项模拟研究,评估了针对小样本和系统发育不平衡样本的进化位点速率估计的稳健性。首先针对包含5 - 45个序列的比对评估速率估计的抽样误差,这些序列通过刀切法从包含968个序列的主比对中抽样得到。我们观察到,由于包含更多的速率类别而可能增强的位点速率分辨率,会因难以正确估计中间速率而被抵消。对于少于30个序列的数据集,这种影响很明显。尽管似然性计算理论上通过分支长度考虑了系统发育距离,但引入单个长分支异常序列对位点速率估计有显著的负面影响。最后,相关谱系之间进化速率的变化可能表明蛋白质家族中功能的获得/丧失。我们的分析表明,检测这些速率变化比估计速率更困难。部分原因在于,速率差异取决于两个速率估计值,每个估计值都有内在的不确定性。本文评估并比较了四种检测这些位点速率变化的方法的性能。文中还给出了关于准备受序列抽样引入的误差影响最小的数据集的指导原则。