INRA, UMR CBGP (INRA-IRD-Cirad-Montpellier SupAgro), Campus international de Baillarguet, CS 30016, F-34988, Montferrier-sur-Lez, France.
Mol Ecol. 2013 Jul;22(14):3766-79. doi: 10.1111/mec.12360. Epub 2013 Jun 4.
Molecular markers produced by next-generation sequencing (NGS) technologies are revolutionizing genetic research. However, the costs of analysing large numbers of individual genomes remain prohibitive for most population genetics studies. Here, we present results based on mathematical derivations showing that, under many realistic experimental designs, NGS of DNA pools from diploid individuals allows to estimate the allele frequencies at single nucleotide polymorphisms (SNPs) with at least the same accuracy as individual-based analyses, for considerably lower library construction and sequencing efforts. These findings remain true when taking into account the possibility of substantially unequal contributions of each individual to the final pool of sequence reads. We propose the intuitive notion of effective pool size to account for unequal pooling and derive a Bayesian hierarchical model to estimate this parameter directly from the data. We provide a user-friendly application assessing the accuracy of allele frequency estimation from both pool- and individual-based NGS population data under various sampling, sequencing depth and experimental error designs. We illustrate our findings with theoretical examples and real data sets corresponding to SNP loci obtained using restriction site-associated DNA (RAD) sequencing in pool- and individual-based experiments carried out on the same population of the pine processionary moth (Thaumetopoea pityocampa). NGS of DNA pools might not be optimal for all types of studies but provides a cost-effective approach for estimating allele frequencies for very large numbers of SNPs. It thus allows comparison of genome-wide patterns of genetic variation for large numbers of individuals in multiple populations.
基于下一代测序(NGS)技术产生的分子标记正在彻底改变遗传研究。然而,对于大多数群体遗传学研究来说,分析大量个体基因组的成本仍然过高。在这里,我们根据数学推导的结果表明,在许多现实的实验设计下,对二倍体个体的 DNA 池进行 NGS 可以至少与个体分析一样准确地估计单核苷酸多态性(SNP)的等位基因频率,而文库构建和测序工作的成本要低得多。当考虑到每个个体对最终序列读取池的贡献可能存在显著差异时,这些发现仍然成立。我们提出了有效池大小的直观概念来解释不等池化,并从数据中直接推导出贝叶斯分层模型来估计这个参数。我们提供了一个用户友好的应用程序,可以根据各种采样、测序深度和实验误差设计,评估从个体和群体 NGS 群体数据中估计等位基因频率的准确性。我们通过理论示例和真实数据集来说明我们的发现,这些数据集对应于在松毛虫(Thaumetopoea pityocampa)同一群体中进行的基于池和个体的实验中使用限制性位点相关 DNA(RAD)测序获得的 SNP 位点。对于所有类型的研究来说,DNA 池的 NGS 可能不是最优的,但它为估计非常大量 SNP 的等位基因频率提供了一种具有成本效益的方法。因此,它允许在多个群体中比较大量个体的全基因组遗传变异模式。