Department of Biostatistics, University of California, Los Angeles.
Mol Biol Evol. 2014 Mar;31(3):723-35. doi: 10.1093/molbev/mst229. Epub 2013 Nov 27.
The site frequency spectrum (SFS) is of primary interest in population genetic studies, because the SFS compresses variation data into a simple summary from which many population genetic inferences can proceed. However, inferring the SFS from sequencing data is challenging because genotype calls from sequencing data are often inaccurate due to high error rates and if not accounted for, this genotype uncertainty can lead to serious bias in downstream analysis based on the inferred SFS. Here, we compare two approaches to estimate the SFS from sequencing data: one approach infers individual genotypes from aligned sequencing reads and then estimates the SFS based on the inferred genotypes (call-based approach) and the other approach directly estimates the SFS from aligned sequencing reads by maximum likelihood (direct estimation approach). We find that the SFS estimated by the direct estimation approach is unbiased even at low coverage, whereas the SFS by the call-based approach becomes biased as coverage decreases. The direction of the bias in the call-based approach depends on the pipeline to infer genotypes. Estimating genotypes by pooling individuals in a sample (multisample calling) results in underestimation of the number of rare variants, whereas estimating genotypes in each individual and merging them later (single-sample calling) leads to overestimation of rare variants. We characterize the impact of these biases on downstream analyses, such as demographic parameter estimation and genome-wide selection scans. Our work highlights that depending on the pipeline used to infer the SFS, one can reach different conclusions in population genetic inference with the same data set. Thus, careful attention to the analysis pipeline and SFS estimation procedures is vital for population genetic inferences.
位点频率谱(SFS)是群体遗传学研究的主要关注点,因为 SFS 将变异数据压缩为一个简单的摘要,从中可以进行许多群体遗传推断。然而,从测序数据推断 SFS 具有挑战性,因为测序数据的基因型调用由于错误率高而往往不准确,如果不加以考虑,这种基因型不确定性会导致基于推断的 SFS 的下游分析中出现严重偏差。在这里,我们比较了两种从测序数据估计 SFS 的方法:一种方法从比对的测序读取中推断个体基因型,然后基于推断的基因型估计 SFS(基于调用的方法),另一种方法直接从比对的测序读取中通过最大似然估计 SFS(直接估计方法)。我们发现,即使在低覆盖率下,直接估计方法估计的 SFS 也是无偏的,而基于调用的方法的 SFS 随着覆盖率的降低变得有偏差。基于调用的方法中的偏差方向取决于推断基因型的管道。通过在样本中汇集个体来估计基因型(多样本调用)会导致稀有变异数量的低估,而在每个个体中估计基因型并稍后合并它们(单样本调用)会导致稀有变异的高估。我们描述了这些偏差对下游分析的影响,例如人口参数估计和全基因组选择扫描。我们的工作强调,根据用于推断 SFS 的管道,即使使用相同的数据集,在群体遗传推断中也可以得出不同的结论。因此,对分析管道和 SFS 估计程序的仔细关注对于群体遗传推断至关重要。