Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, UK.
Genetics. 2011 Aug;188(4):931-40. doi: 10.1534/genetics.111.128355. Epub 2011 May 19.
Sequencing errors and random sampling of nucleotide types among sequencing reads at heterozygous sites present challenges for accurate, unbiased inference of single-nucleotide polymorphism genotypes from high-throughput sequence data. Here, we develop a maximum-likelihood approach to estimate the frequency distribution of the number of alleles in a sample of individuals (the site frequency spectrum), using high-throughput sequence data. Our method assumes binomial sampling of nucleotide types in heterozygotes and random sequencing error. By simulations, we show that close to unbiased estimates of the site frequency spectrum can be obtained if the error rate per base read does not exceed the population nucleotide diversity. We also show that these estimates are reasonably robust if errors are nonrandom. We then apply the method to infer site frequency spectra for zerofold degenerate, fourfold degenerate, and intronic sites of protein-coding genes using the low coverage human sequence data produced by the 1000 Genomes Project phase-one pilot. By fitting a model to the inferred site frequency spectra that estimates parameters of the distribution of fitness effects of new mutations, we find evidence for significant natural selection operating on fourfold sites. We also find that a model with variable effects of mutations at synonymous sites fits the data significantly better than a model with equal mutational effects. Under the variable effects model, we infer that 11% of synonymous mutations are subject to strong purifying selection.
在杂合位点的测序读段中,核苷酸类型的测序错误和随机抽样给从高通量测序数据中准确、无偏地推断单核苷酸多态性基因型带来了挑战。在这里,我们开发了一种最大似然方法,用于估计个体样本中等位基因数量的频率分布(即位点频率谱),使用高通量测序数据。我们的方法假设在杂合子中核苷酸类型的二项式抽样和随机测序错误。通过模拟,我们表明如果每个碱基读取的错误率不超过群体核苷酸多样性,则可以获得接近无偏的位点频率谱估计值。我们还表明,如果错误是非随机的,这些估计值是相当稳健的。然后,我们应用该方法推断零倍简并、四倍简并和蛋白质编码基因内含子位点的位点频率谱,使用 1000 基因组计划一期试点产生的低覆盖率人类序列数据。通过拟合一个模型来推断位点频率谱,该模型估计新突变适应度效应分布的参数,我们发现四倍位点存在显著的自然选择证据。我们还发现,一个具有同义位点突变可变效应的模型比一个具有相等突变效应的模型更能显著拟合数据。在可变效应模型下,我们推断出 11%的同义突变受到强烈的纯化选择。