Bhaskar Anand, Wang Y X Rachel, Song Yun S
Simons Institute for the Theory of Computing, Berkeley, California 94720, USA; Computer Science Division, University of California, Berkeley, California 94720, USA;
Department of Statistics, University of California, Berkeley, California 94720, USA;
Genome Res. 2015 Feb;25(2):268-79. doi: 10.1101/gr.178756.114. Epub 2015 Jan 6.
With the recent increase in study sample sizes in human genetics, there has been growing interest in inferring historical population demography from genomic variation data. Here, we present an efficient inference method that can scale up to very large samples, with tens or hundreds of thousands of individuals. Specifically, by utilizing analytic results on the expected frequency spectrum under the coalescent and by leveraging the technique of automatic differentiation, which allows us to compute gradients exactly, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum. In addition to inferring demography, our method can also accurately estimate locus-specific mutation rates. We perform extensive validation of our method on simulated data and show that it can accurately infer multiple recent epochs of rapid exponential growth, a signal that is difficult to pick up with small sample sizes. Lastly, we use our method to analyze data from recent sequencing studies, including a large-sample exome-sequencing data set of tens of thousands of individuals assayed at a few hundred genic regions.
随着人类遗传学研究样本量最近的增加,从基因组变异数据推断历史种群人口统计学的兴趣日益浓厚。在这里,我们提出了一种有效的推断方法,该方法可以扩展到非常大的样本,包含数万或数十万人。具体而言,通过利用在溯祖模型下预期频率谱的分析结果,并借助自动微分技术(这使我们能够精确计算梯度),我们开发了一种非常有效的算法,用于从样本等位基因频率分布推断历史有效种群大小的分段指数模型。我们的方法比以前基于频率谱的人口统计学推断方法快几个数量级。除了推断人口统计学,我们的方法还可以准确估计基因座特异性突变率。我们在模拟数据上对我们的方法进行了广泛验证,并表明它可以准确推断多个近期快速指数增长的时期,这是小样本量难以检测到的信号。最后,我们使用我们的方法分析近期测序研究的数据,包括在数百个基因区域检测的数万人的大样本外显子测序数据集。