Patterson Nick, Price Alkes L, Reich David
Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America.
PLoS Genet. 2006 Dec;2(12):e190. doi: 10.1371/journal.pgen.0020190.
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general "phase change" phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.
目前从基因数据推断群体结构的方法并未提供针对群体分化的形式化显著性检验。我们讨论一种研究群体结构的方法(主成分分析),该方法最初由卡瓦利 - 斯福扎及其同事应用于基因数据。我们利用现代统计学的结果来开发形式化显著性检验,从而将该方法置于坚实的统计基础之上。我们还揭示了一个关于在基因数据中检测结构能力的普遍“相变”现象,这一现象源自我们所使用的统计理论,并且对于在基因数据中发现结构的能力具有重要意义:对于固定但较大的数据集规模,两个群体之间的差异(例如,通过像FST这样的统计量来衡量)低于某个阈值时基本上无法检测到,但略高于阈值时,检测就会很容易。这意味着我们可以预测检测结构所需的数据集规模。