Willis Stuart, Micheletti Steven, Andrews Kimberly R, Narum Shawn
Hagerman Genetics Lab, Columbia River Inter-Tribal Fish Commission, Hagerman, Idaho, USA.
Department of Zoology, University of British Columbia, Vancouver, British Columbia, Canada.
Mol Ecol Resour. 2025 Jul;25(5):e13888. doi: 10.1111/1755-0998.13888. Epub 2023 Nov 3.
Whole-genome sequencing data allow survey of variation from across the genome, reducing the constraint of balancing genome sub-sampling with estimating recombination rates and linkage between sampled markers and target loci. As sequencing costs decrease, low-coverage whole-genome sequencing of pooled or indexed-individual samples is commonly utilized to identify loci associated with phenotypes or environmental axes in non-model organisms. There are, however, relatively few publicly available bioinformatic pipelines designed explicitly to analyse these types of data, and fewer still that process the raw sequencing data, provide useful metrics of quality control and then execute analyses. Here, we present an updated version of a bioinformatics pipeline called PoolParty2 that can effectively handle either pooled or indexed DNA samples and includes new features to improve computational efficiency. Using simulated data, we demonstrate the ability of our pipeline to recover segregating variants, estimate their allele frequencies accurately, and identify genomic regions harbouring loci under selection. Based on the simulated data set, we benchmark the efficacy of our pipeline with another bioinformatic suite, angsd, and illustrate the compatibility and complementarity of these suites using angsd to generate genotype likelihoods as input for identifying linkage outlier regions using alignment files and variants provided by PoolParty2. Finally, we apply our updated pipeline to an empirical dataset of low-coverage whole genomic data from population samples of Columbia River steelhead trout (Oncorhynchus mykiss), results from which demonstrate the genomic impacts of decades of artificial selection in a prominent hatchery stock. Thus, we not only demonstrate the utility of PoolParty2 for genomic studies that combine sequencing data from multiple individuals, but also illustrate how it compliments other bioinformatics resources such as angsd.
全基因组测序数据能够对整个基因组的变异进行检测,减少了在平衡基因组子采样与估计重组率以及样本标记与目标位点之间的连锁关系时所受到的限制。随着测序成本的降低,对混合样本或索引个体样本进行低覆盖度全基因组测序通常被用于识别非模式生物中与表型或环境轴相关的位点。然而,专门设计用于分析这类数据的公开可用生物信息学流程相对较少,能够处理原始测序数据、提供有用的质量控制指标并随后执行分析的流程更是少之又少。在此,我们展示了一个名为PoolParty2的生物信息学流程的更新版本,它能够有效处理混合或索引DNA样本,并包含提高计算效率的新功能。通过模拟数据,我们证明了我们的流程能够找回分离变异、准确估计其等位基因频率,并识别出存在选择位点的基因组区域。基于模拟数据集,我们将我们的流程与另一个生物信息学套件angsd的功效进行了基准测试,并通过使用angsd生成基因型似然性作为输入,利用PoolParty2提供的比对文件和变异来识别连锁异常区域,从而说明了这些套件的兼容性和互补性。最后,我们将更新后的流程应用于哥伦比亚河虹鳟(Oncorhynchus mykiss)种群样本的低覆盖度全基因组数据的实证数据集,结果表明了几十年人工选择对一个著名孵化场种群的基因组影响。因此,我们不仅证明了PoolParty2在结合多个个体测序数据的基因组研究中的实用性,还说明了它如何补充其他生物信息学资源,如angsd。