Institute of Evolutionary Biology, CSIC-Universitat Pompeu Fabra, Barcelona, Spain.
Mol Ecol Resour. 2021 May;21(4):1216-1229. doi: 10.1111/1755-0998.13343. Epub 2021 Mar 5.
Population genomics is a fast-developing discipline with promising applications in a growing number of life sciences fields. Advances in sequencing technologies and bioinformatics tools allow population genomics to exploit genome-wide information to identify the molecular variants underlying traits of interest and the evolutionary forces that modulate these variants through space and time. However, the cost of genomic analyses of multiple populations is still too high to address them through individual genome sequencing. Pooling individuals for sequencing can be a more effective strategy in Single Nucleotide Polymorphism (SNP) detection and allele frequency estimation because of a higher total coverage. However, compared to individual sequencing, SNP calling from pools has the additional difficulty of distinguishing rare variants from sequencing errors, which is often avoided by establishing a minimum threshold allele frequency for the analysis. Finding an optimal balance between minimizing information loss and reducing sequencing costs is essential to ensure the success of population genomics studies. Here, we have benchmarked the performance of SNP callers for Pool-seq data, based on different approaches, under different conditions, and using computer simulations and real data. We found that SNP callers performance varied for allele frequencies up to 0.35. We also found that SNP callers based on Bayesian (SNAPE-pooled) or maximum likelihood (MAPGD) approaches outperform the two heuristic callers tested (VarScan and PoolSNP), in terms of the balance between sensitivity and FDR both in simulated and sequencing data. Our results will help inform the selection of the most appropriate SNP caller not only for large-scale population studies but also in cases where the Pool-seq strategy is the only option, such as in metagenomic or polyploid studies.
群体基因组学是一个快速发展的学科,在越来越多的生命科学领域有着广阔的应用前景。测序技术和生物信息学工具的进步使得群体基因组学能够利用全基因组信息来识别感兴趣性状的分子变异体,以及调节这些变异体在空间和时间上的进化力量。然而,对多个群体进行基因组分析的成本仍然太高,无法通过个体基因组测序来解决。通过对个体进行测序,pool-seq 可以成为一种更有效的策略,因为它可以提高 SNP 检测和等位基因频率估计的总覆盖率。然而,与个体测序相比,pool-seq 从池中调用 SNP 还有一个额外的困难,即需要从测序错误中区分罕见的变异体,这通常通过为分析建立一个最小的等位基因频率阈值来避免。在最小化信息损失和降低测序成本之间找到一个最佳平衡点,对于确保群体基因组学研究的成功至关重要。在这里,我们基于不同的方法,在不同的条件下,使用计算机模拟和真实数据,对 pool-seq 数据的 SNP 调用器的性能进行了基准测试。我们发现,在等位基因频率高达 0.35 的情况下,SNP 调用器的性能存在差异。我们还发现,基于贝叶斯(SNAPE-pooled)或最大似然(MAPGD)方法的 SNP 调用器,在模拟和测序数据中,在灵敏度和 FDR 之间的平衡方面,都优于两种启发式调用器(VarScan 和 PoolSNP)。我们的研究结果将有助于指导选择最合适的 SNP 调用器,不仅适用于大规模的群体研究,也适用于 pool-seq 策略是唯一选择的情况,例如在宏基因组学或多倍体研究中。