Guo Feng, Dey Dipak K, Holsinger Kent E
Feng Guo is Assistant Professor of Statistics, Department of Statistics, Virginia Tech, Blacksburg, VA 24061 (email:
J Am Stat Assoc. 2009 Mar 1;104(485):142-154. doi: 10.1198/jasa.2009.0010.
The distribution of genetic variation among populations is conveniently measured by Wright's F(ST), which is a scaled variance taking on values in [0,1]. For certain types of genetic markers, and for single-nucleotide polymorphisms (SNPs) in particular, it is reasonable to presume that allelic differences at most loci are selectively neutral. For such loci, the distribution of genetic variation among populations is determined by the size of local populations, the pattern and rate of migration among those populations, and the rate of mutation. Because the demographic parameters (population sizes and migration rates) are common across all autosomal loci, locus-specific estimates of F(ST) will depart from a common distribution only for loci with unusually high or low rates of mutation or for loci that are closely associated with genomic regions having a relationship with fitness. Thus, loci that are statistical outliers showing significantly more among-population differentiation than others may mark genomic regions subject to diversifying selection among the sample populations. Similarly, statistical outliers showing significantly less differentiation among populations than others may mark genomic regions subject to stabilizing selection across the sample populations. We propose several Bayesian hierarchical models to estimate locus-specific effects on F(ST), and we apply these models to single nucleotide polymorphism data from the HapMap project. Because loci that are physically associated with one another are likely to show similar patterns of variation, we introduce conditional autoregressive models to incorporate the local correlation among loci for high-resolution genomic data. We estimate the posterior distributions of model parameters using Markov chain Monte Carlo (MCMC) simulations. Model comparison using several criteria, including DIC and LPML, reveals that a model with locus- and population-specific effects is superior to other models for the data used in the analysis. To detect statistical outliers we propose an approach that measures divergence between the posterior distributions of locus-specific effects and the common F(ST) with the Kullback-Leibler divergence measure. We calibrate this measure by comparing values with those produced from the divergence between a biased and a fair coin. We conduct a simulation study to illustrate the performance of our approach for detecting loci subject to stabilizing/divergent selection, and we apply the proposed models to low- and high-resolution SNP data from the HapMap project. Model comparison using DIC and LPML reveals that CAR models are superior to alternative models for the high resolution data. For both low and high resolution data, we identify statistical outliers that are associated with known genes.
群体间遗传变异的分布可以通过赖特的F(ST)方便地测量,F(ST)是一个标度化的方差,取值范围为[0,1]。对于某些类型的遗传标记,特别是单核苷酸多态性(SNP),可以合理地假定大多数位点的等位基因差异是选择性中性的。对于这些位点,群体间遗传变异的分布由当地群体的大小、这些群体间的迁移模式和速率以及突变率决定。由于人口统计学参数(群体大小和迁移率)在所有常染色体位点上是共同的,只有突变率异常高或低的位点,或者与与适应性相关的基因组区域紧密相关的位点,F(ST)的位点特异性估计才会偏离共同分布。因此,作为统计异常值且显示出比其他位点显著更多群体间分化的位点,可能标记了样本群体中受到多样化选择的基因组区域。同样,作为统计异常值且显示出比其他位点显著更少群体间分化的位点,可能标记了样本群体中受到稳定选择的基因组区域。我们提出了几个贝叶斯层次模型来估计对F(ST)的位点特异性效应,并将这些模型应用于国际人类基因组单体型图计划(HapMap计划)的单核苷酸多态性数据。由于彼此物理相关的位点可能显示出相似的变异模式,我们引入条件自回归模型以纳入高分辨率基因组数据中位点间的局部相关性。我们使用马尔可夫链蒙特卡罗(MCMC)模拟估计模型参数的后验分布。使用包括DIC和LPML在内的几个标准进行模型比较,结果表明对于分析中使用的数据,具有位点和群体特异性效应的模型优于其他模型。为了检测统计异常值,我们提出了一种方法,该方法使用库尔贝克-莱布勒散度度量来测量位点特异性效应的后验分布与共同的F(ST)之间的差异。我们通过将值与由有偏硬币和公平硬币之间的差异产生的值进行比较来校准此度量。我们进行了一项模拟研究,以说明我们检测受到稳定/分化选择的位点的方法的性能,并将所提出的模型应用于HapMap计划的低分辨率和高分辨率SNP数据。使用DIC和LPML进行模型比较表明,对于高分辨率数据,CAR模型优于替代模型。对于低分辨率和高分辨率数据,我们都识别出了与已知基因相关的统计异常值。