Department of Genetics, Evolution, and Environment, University College London, London, WC1E 6BT, United Kingdom.
Genetics. 2013 Sep;195(1):195-204. doi: 10.1534/genetics.113.152025. Epub 2013 Jun 21.
Several studies have reported a negative correlation between estimates of the nonsynonymous to synonymous rate ratio (ω = dN/dS) and the sequence distance d in pairwise comparisons of the same gene from different species. That is, more divergent sequences produce smaller estimates of ω. Explanations for this negative correlation have included segregating nonsynonymous polymorphisms in closely related species and nonlinear dynamics of the ratio of two random variables. Here we study the statistical properties of the maximum-likelihood estimates of ω and d in pairwise alignments and explore the possibility that the negative correlation can be entirely explained by those properties. We show that the ω estimate is positively biased for small d and that the bias decreases with the increase of d. We also show that the estimates of ω and d are negatively correlated when ω < 1 and positively correlated when ω > 1. However, the bias in estimates of ω and the correlation between estimates of ω and d are not enough to explain the much stronger correlation observed in real data sets. We then explore the behavior of the estimates when the model is misspecified and suggest that the observed correlation may be due to protein-level selection that causes very different amino acids to be favored in different domains of the protein. Widely used models fail to account for such among-site heterogeneity and cause underestimation of the nonsynonymous rate and ω, with the bias being much stronger for distant sequences. We point out that tests of positive selection based on the ω ratio are invariant to the parameterization of the model and thus unaffected by bias in the ω estimates or the correlation between estimates of ω and d.
已有几项研究报告称,在比较不同物种同一基因的成对序列时,非同义替换与同义替换的比率(ω=dN/dS)的估计值与序列距离 d 之间存在负相关关系。也就是说,差异较大的序列产生的ω估计值较小。对于这种负相关关系的解释包括在亲缘关系较近的物种中存在分离的非同义多态性和两个随机变量比值的非线性动态。在这里,我们研究了成对比对中最大似然估计值 ω 和 d 的统计性质,并探讨了这种负相关关系是否可以完全由这些性质来解释。我们表明,对于较小的 d,ω 的估计值存在正偏差,并且随着 d 的增加,偏差会减小。我们还表明,当 ω < 1 时,ω 和 d 的估计值呈负相关,当 ω > 1 时,ω 和 d 的估计值呈正相关。然而,ω 估计值的偏差以及 ω 和 d 的估计值之间的相关性不足以解释在实际数据集观察到的更强相关性。然后,我们探索了模型误定时估计值的行为,并提出观察到的相关性可能是由于蛋白质水平的选择,导致不同蛋白质域中偏好非常不同的氨基酸。广泛使用的模型无法解释这种位置异质性,导致非同义替换率和 ω 的低估,对于距离较远的序列,偏差要强得多。我们指出,基于ω 比值的正选择检验对于模型的参数化是不变的,因此不受 ω 估计值的偏差或 ω 和 d 的估计值之间的相关性的影响。