Jenkins Paul A, Mueller Jonas W, Song Yun S
Department of Statistics, University of Warwick, Coventry CV4 7AL, United Kingdom.
Genetics. 2014 Jan;196(1):295-311. doi: 10.1534/genetics.113.158584. Epub 2013 Nov 8.
It is becoming routine to obtain data sets on DNA sequence variation across several thousands of chromosomes, providing unprecedented opportunity to infer the underlying biological and demographic forces. Such data make it vital to study summary statistics that offer enough compression to be tractable, while preserving a great deal of information. One well-studied summary is the site frequency spectrum-the empirical distribution, across segregating sites, of the sample frequency of the derived allele. However, most previous theoretical work has assumed that each site has experienced at most one mutation event in its genealogical history, which becomes less tenable for very large sample sizes. In this work we obtain, in closed form, the predicted frequency spectrum of a site that has experienced at most two mutation events, under very general assumptions about the distribution of branch lengths in the underlying coalescent tree. Among other applications, we obtain the frequency spectrum of a triallelic site in a model of historically varying population size. We demonstrate the utility of our formulas in two settings: First, we show that triallelic sites are more sensitive to the parameters of a population that has experienced historical growth, suggesting that they will have use if they can be incorporated into demographic inference. Second, we investigate a recently proposed alternative mechanism of mutation in which the two derived alleles of a triallelic site are created simultaneously within a single individual, and we develop a test to determine whether it is responsible for the excess of triallelic sites in the human genome.
获取跨越数千条染色体的DNA序列变异数据集正变得越来越常规,这为推断潜在的生物学和人口统计学力量提供了前所未有的机会。这些数据使得研究汇总统计量变得至关重要,这些统计量要提供足够的压缩以便易于处理,同时保留大量信息。一个经过充分研究的汇总统计量是位点频率谱——在分离位点上,衍生等位基因样本频率的经验分布。然而,以前的大多数理论工作都假设每个位点在其系谱历史中最多经历一次突变事件,对于非常大的样本量来说,这一假设变得越来越站不住脚。在这项工作中,在关于基础合并树中分支长度分布的非常一般的假设下,我们以封闭形式获得了一个最多经历两次突变事件的位点的预测频率谱。在其他应用中,我们获得了在历史上种群大小变化的模型中三等位基因位点的频率谱。我们在两种情况下展示了我们公式的实用性:首先,我们表明三等位基因位点对经历过历史增长的种群参数更敏感,这表明如果它们能够被纳入人口统计学推断中将会很有用。其次,我们研究了一种最近提出的替代突变机制,其中三等位基因位点的两个衍生等位基因在单个个体内同时产生,并且我们开发了一种测试来确定它是否是人类基因组中三等位基因位点过多的原因。