Institut National de la Recherche Agronomique, Unité Mixte de Recherche CBGP, (Inra, Ird, Cirad, Montpellier-SupAgro) 34988 Montferrier-sur-Lez Cedex, France.
Genetics. 2014 Mar;196(3):799-817. doi: 10.1534/genetics.113.152991. Epub 2013 Dec 20.
The recent advent of high-throughput sequencing and genotyping technologies makes it possible to produce, easily and cost effectively, large amounts of detailed data on the genotype composition of populations. Detecting locus-specific effects may help identify those genes that have been, or are currently, targeted by natural selection. How best to identify these selected regions, loci, or single nucleotides remains a challenging issue. Here, we introduce a new model-based method, called SelEstim, to distinguish putative selected polymorphisms from the background of neutral (or nearly neutral) ones and to estimate the intensity of selection at the former. The underlying population genetic model is a diffusion approximation for the distribution of allele frequency in a population subdivided into a number of demes that exchange migrants. We use a Markov chain Monte Carlo algorithm for sampling from the joint posterior distribution of the model parameters, in a hierarchical Bayesian framework. We present evidence from stochastic simulations, which demonstrates the good power of SelEstim to identify loci targeted by selection and to estimate the strength of selection acting on these loci, within each deme. We also reanalyze a subset of SNP data from the Stanford HGDP-CEPH Human Genome Diversity Cell Line Panel to illustrate the performance of SelEstim on real data. In agreement with previous studies, our analyses point to a very strong signal of positive selection upstream of the LCT gene, which encodes for the enzyme lactase-phlorizin hydrolase and is associated with adult-type hypolactasia. The geographical distribution of the strength of positive selection across the Old World matches the interpolated map of lactase persistence phenotype frequencies, with the strongest selection coefficients in Europe and in the Indus Valley.
高通量测序和基因分型技术的出现使得人们可以轻松、经济高效地产生大量有关人群基因型组成的详细数据。检测特定基因座的效应有助于识别那些曾被自然选择或当前正被自然选择靶向的基因。如何最好地识别这些选择区域、基因座或单核苷酸仍然是一个具有挑战性的问题。在这里,我们引入了一种新的基于模型的方法,称为 SelEstim,用于区分假定的选择多态性与中性(或近乎中性)背景下的多态性,并估计前者的选择强度。潜在的群体遗传模型是一个在划分为若干交换移民群体的种群中等位基因频率分布的扩散近似模型。我们在分层贝叶斯框架中使用马尔可夫链蒙特卡罗算法从模型参数的联合后验分布中进行采样。我们从随机模拟中提供了证据,证明了 SelEstim 在识别被选择靶向的基因座和估计这些基因座上选择作用的强度方面具有良好的功效,在每个群体中都是如此。我们还重新分析了斯坦福人类基因组多样性细胞系面板(Stanford HGDP-CEPH Human Genome Diversity Cell Line Panel)中 SNP 数据的一个子集,以说明 SelEstim 在真实数据上的性能。与先前的研究一致,我们的分析表明,在编码乳糖酶-植酸钠水解酶的 LCT 基因上游存在非常强烈的正选择信号,该基因与成人型乳糖不耐受有关。在旧世界范围内,正选择的强度在地理上的分布与乳糖持续存在表型频率的插值图相匹配,在欧洲和印度河流域最强。