Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, United Kingdom.
Mol Biol Evol. 2010 Jun;27(6):1327-37. doi: 10.1093/molbev/msq023. Epub 2010 Jan 27.
Analysis of within-species polymorphism data usually relies on population genetic models that assume two alleles at a locus (e.g., the infinite sites model). However, many problems of interest can be tackled more naturally by multiallele models. In this study, I construct a model that can accommodate an arbitrary number of alleles at a locus, mutational biases, and selective differences between each of the alleles. It is constructed by representing population dynamics by a Markov transition matrix and is based on the assumption that at most two variants exist at each polymorphic site. A likelihood-based method for inferring the selection and mutational parameters of the model is constructed and is shown to have high accuracy. I use this method to jointly infer preferred codons and mutational parameters in Drosophila melanogaster. Twenty-one codons are identified as preferred, 19 of which were found previously by methods that do not use polymorphism data. Interestingly, the selective difference between the fittest and the worst codons encoding the same amino acid is positively correlated with the number of synonymous codons for that amino acid, in agreement with previous analyses of interspecies data using phylogenetic models. The inferred mutation matrix is highly asymmetric, with C-->T and G-->A being the most common and constituting approximately 18% and approximately 19% of all mutation events, respectively. These results suggest that the new model provides a useful framework for analyzing polymorphism data sampled from multiallele systems.
对种内多态性数据的分析通常依赖于种群遗传模型,这些模型假设一个基因座上有两个等位基因(例如,无限位点模型)。然而,许多感兴趣的问题可以通过多等位基因模型更自然地解决。在这项研究中,我构建了一个可以在基因座上容纳任意数量的等位基因、突变偏倚以及每个等位基因之间的选择差异的模型。它是通过用马尔可夫转移矩阵来表示种群动态,并基于每个多态性位点最多存在两个变体的假设构建的。构建了一种基于似然的推断模型选择和突变参数的方法,并证明具有很高的准确性。我使用这种方法来共同推断果蝇中优选的密码子和突变参数。鉴定出 21 个密码子为优选密码子,其中 19 个是先前使用不使用多态性数据的方法发现的。有趣的是,编码相同氨基酸的最佳和最差密码子之间的选择差异与该氨基酸的同义密码子数量呈正相关,这与使用系统发育模型分析种间数据的先前分析一致。推断出的突变矩阵高度不对称,C-->T 和 G-->A 是最常见的,分别构成所有突变事件的约 18%和约 19%。这些结果表明,新模型为分析从多等位基因系统中采样的多态性数据提供了一个有用的框架。