正选择下氨基酸位点的贝叶斯经验贝叶斯推断

Bayes empirical bayes inference of amino acid sites under positive selection.

作者信息

Yang Ziheng, Wong Wendy S W, Nielsen Rasmus

机构信息

Department of Biology, University College London, London, UK.

出版信息

Mol Biol Evol. 2005 Apr;22(4):1107-18. doi: 10.1093/molbev/msi097. Epub 2005 Feb 2.

DOI:10.1093/molbev/msi097

PMID:15689528

Abstract

Codon-based substitution models have been widely used to identify amino acid sites under positive selection in comparative analysis of protein-coding DNA sequences. The nonsynonymous-synonymous substitution rate ratio (d(N)/d(S), denoted omega) is used as a measure of selective pressure at the protein level, with omega > 1 indicating positive selection. Statistical distributions are used to model the variation in omega among sites, allowing a subset of sites to have omega > 1 while the rest of the sequence may be under purifying selection with omega < 1. An empirical Bayes (EB) approach is then used to calculate posterior probabilities that a site comes from the site class with omega > 1. Current implementations, however, use the naive EB (NEB) approach and fail to account for sampling errors in maximum likelihood estimates of model parameters, such as the proportions and omega ratios for the site classes. In small data sets lacking information, this approach may lead to unreliable posterior probability calculations. In this paper, we develop a Bayes empirical Bayes (BEB) approach to the problem, which assigns a prior to the model parameters and integrates over their uncertainties. We compare the new and old methods on real and simulated data sets. The results suggest that in small data sets the new BEB method does not generate false positives as did the old NEB approach, while in large data sets it retains the good power of the NEB approach for inferring positively selected sites.

摘要

基于密码子的替换模型已被广泛用于在蛋白质编码DNA序列的比较分析中识别正选择下的氨基酸位点。非同义替换与同义替换率之比（d(N)/d(S)，记为ω）被用作蛋白质水平上选择压力的度量，ω>1表示正选择。统计分布用于对位点间ω的变化进行建模，允许一部分位点的ω>1，而序列的其余部分可能处于ω<1的纯化选择之下。然后使用经验贝叶斯（EB）方法来计算一个位点来自ω>1的位点类别的后验概率。然而，当前的实现使用朴素经验贝叶斯（NEB）方法，并且没有考虑模型参数最大似然估计中的抽样误差，例如位点类别的比例和ω比率。在缺乏信息的小数据集中，这种方法可能导致不可靠的后验概率计算。在本文中，我们针对该问题开发了一种贝叶斯经验贝叶斯（BEB）方法，该方法为模型参数指定一个先验，并对其不确定性进行积分。我们在真实和模拟数据集上比较了新旧方法。结果表明，在小数据集中，新的BEB方法不会像旧的NEB方法那样产生假阳性，而在大数据集中，它保留了NEB方法推断正选择位点的良好能力。