Pond Sergei L Kosakovsky, Frost Simon D W
Antiviral Research Center, University of California San Diego, San Diego, California, USA.
Mol Biol Evol. 2005 Feb;22(2):223-34. doi: 10.1093/molbev/msi009. Epub 2004 Oct 13.
Genetic sequence data typically exhibit variability in substitution rates across sites. In practice, there is often too little variation to fit a different rate for each site in the alignment, but the distribution of rates across sites may not be well modeled using simple parametric families. Mixtures of different distributions can capture more complex patterns of rate variation, but are often parameter-rich and difficult to fit. We present a simple hierarchical model in which a baseline rate distribution, such as a gamma distribution, is discretized into several categories, the quantiles of which are estimated using a discretized beta distribution. Although this approach involves adding only two extra parameters to a standard distribution, a wide range of rate distributions can be captured. Using simulated data, we demonstrate that a "beta-" model can reproduce the moments of the rate distribution more accurately than the distribution used to simulate the data, even when the baseline rate distribution is misspecified. Using hepatitis C virus and mammalian mitochondrial sequences, we show that a beta- model can fit as well or better than a model with multiple discrete rate categories, and compares favorably with a model which fits a separate rate category to each site. We also demonstrate this discretization scheme in the context of codon models specifically aimed at identifying individual sites undergoing adaptive or purifying evolution.
基因序列数据通常在不同位点的替换率上表现出变异性。在实际中,变异往往太少以至于无法为比对中的每个位点拟合不同的速率,但使用简单的参数族可能无法很好地模拟位点间的速率分布。不同分布的混合可以捕捉更复杂的速率变异模式,但通常参数丰富且难以拟合。我们提出了一个简单的层次模型,其中将基线速率分布(如伽马分布)离散化为几个类别,其分位数使用离散化的贝塔分布进行估计。尽管这种方法仅涉及在标准分布上添加两个额外参数,但可以捕捉到广泛的速率分布。使用模拟数据,我们证明即使基线速率分布指定错误,“贝塔 -”模型也能比用于模拟数据的分布更准确地重现速率分布的矩。使用丙型肝炎病毒和哺乳动物线粒体序列,我们表明贝塔 - 模型的拟合效果与具有多个离散速率类别的模型相当或更好,并且与为每个位点拟合单独速率类别的模型相比具有优势。我们还在专门用于识别经历适应性或纯化进化的单个位点的密码子模型背景下展示了这种离散化方案。