Reeves Patrick A, Richards Christopher M
United States Department of Agriculture, Agricultural Research Service, National Center for Genetic Resources Preservation, Fort Collins, Colorado, United States of America.
PLoS One. 2009;4(1):e4269. doi: 10.1371/journal.pone.0004269. Epub 2009 Jan 27.
Accurate inference of genetic discontinuities between populations is an essential component of intraspecific biodiversity and evolution studies, as well as associative genetics. The most widely-used methods to infer population structure are model-based, Bayesian MCMC procedures that minimize Hardy-Weinberg and linkage disequilibrium within subpopulations. These methods are useful, but suffer from large computational requirements and a dependence on modeling assumptions that may not be met in real data sets. Here we describe the development of a new approach, PCO-MC, which couples principal coordinate analysis to a clustering procedure for the inference of population structure from multilocus genotype data.
METHODOLOGY/PRINCIPAL FINDINGS: PCO-MC uses data from all principal coordinate axes simultaneously to calculate a multidimensional "density landscape", from which the number of subpopulations, and the membership within subpopulations, is determined using a valley-seeking algorithm. Using extensive simulations, we show that this approach outperforms a Bayesian MCMC procedure when many loci (e.g. 100) are sampled, but that the Bayesian procedure is marginally superior with few loci (e.g. 10). When presented with sufficient data, PCO-MC accurately delineated subpopulations with population F(st) values as low as 0.03 (G'(st)>0.2), whereas the limit of resolution of the Bayesian approach was F(st) = 0.05 (G'(st)>0.35).
CONCLUSIONS/SIGNIFICANCE: We draw a distinction between population structure inference for describing biodiversity as opposed to Type I error control in associative genetics. We suggest that discrete assignments, like those produced by PCO-MC, are appropriate for circumscribing units of biodiversity whereas expression of population structure as a continuous variable is more useful for case-control correction in structured association studies.
准确推断种群间的遗传间断是种内生物多样性与进化研究以及关联遗传学的重要组成部分。推断种群结构最广泛使用的方法是基于模型的贝叶斯MCMC程序,该程序可使亚群内的哈迪-温伯格平衡和连锁不平衡最小化。这些方法很有用,但存在计算需求大以及依赖建模假设的问题,而实际数据集可能无法满足这些假设。在此,我们描述了一种新方法PCO-MC的开发,该方法将主坐标分析与聚类程序相结合,用于从多位点基因型数据推断种群结构。
方法/主要发现:PCO-MC同时使用所有主坐标轴的数据来计算多维“密度景观”,并使用谷底搜索算法从中确定亚群数量以及亚群内的成员归属。通过广泛的模拟,我们表明,当采样许多位点(例如100个)时,该方法优于贝叶斯MCMC程序,但在位点较少(例如10个)时,贝叶斯程序略胜一筹。当有足够的数据时,PCO-MC能够准确地划分出Fst值低至0.03(G'st>0.2)的亚群,而贝叶斯方法的分辨率极限是Fst = 0.05(G'st>0.35)。
结论/意义:我们区分了用于描述生物多样性的种群结构推断与关联遗传学中的I型错误控制。我们建议,像PCO-MC产生的离散分配适用于界定生物多样性单元,而将种群结构表示为连续变量对于结构化关联研究中的病例对照校正更有用。