Computer Science Division and Department of Statistics, University of California, Berkeley, CA 94720.
Proc Natl Acad Sci U S A. 2014 Feb 11;111(6):2385-90. doi: 10.1073/pnas.1322709111. Epub 2014 Jan 27.
Study sample sizes in human genetics are growing rapidly, and in due course it will become routine to analyze samples with hundreds of thousands, if not millions, of individuals. In addition to posing computational challenges, such large sample sizes call for carefully reexamining the theoretical foundation underlying commonly used analytical tools. Here, we study the accuracy of the coalescent, a central model for studying the ancestry of a sample of individuals. The coalescent arises as a limit of a large class of random mating models, and it is an accurate approximation to the original model provided that the population size is sufficiently larger than the sample size. We develop a method for performing exact computation in the discrete-time Wright-Fisher (DTWF) model and compare several key genealogical quantities of interest with the coalescent predictions. For recently inferred demographic scenarios, we find that there are a significant number of multiple- and simultaneous-merger events under the DTWF model, which are absent in the coalescent by construction. Furthermore, for large sample sizes, there are noticeable differences in the expected number of rare variants between the coalescent and the DTWF model. To balance the trade-off between accuracy and computational efficiency, we propose a hybrid algorithm that uses the DTWF model for the recent past and the coalescent for the more distant past. Our results demonstrate that the hybrid method with only a handful of generations of the DTWF model leads to a frequency spectrum that is quite close to the prediction of the full DTWF model.
人类遗传学中的研究样本规模正在迅速增长,在适当的时候,分析包含数十万甚至数百万个体的样本将成为常规操作。除了带来计算挑战外,如此大的样本规模还需要仔细重新审视常用分析工具所依据的理论基础。在这里,我们研究了合并模型(coalescent)的准确性,该模型是研究个体样本祖先的核心模型。合并模型是从一大类随机交配模型中得出的极限,并且只要群体大小足够大于样本大小,它就是原始模型的精确近似。我们开发了一种在离散时间 Wright-Fisher(DTWF)模型中进行精确计算的方法,并将几个关键的感兴趣的系统发育数量与合并模型的预测进行了比较。对于最近推断出的人口统计场景,我们发现在 DTWF 模型下存在大量的多次合并和同时合并事件,而在合并模型中根据构造这些事件是不存在的。此外,对于较大的样本大小,在合并模型和 DTWF 模型之间,稀有变异的预期数量存在明显差异。为了在准确性和计算效率之间取得平衡,我们提出了一种混合算法,该算法将 DTWF 模型用于最近的过去,而将合并模型用于更远的过去。我们的结果表明,混合方法仅使用少数几代 DTWF 模型就可以得到与完整 DTWF 模型的预测非常接近的频谱。