Schilling Martin P, Wolf Paul G, Duffy Aaron M, Rai Hardeep S, Rowe Carol A, Richardson Bryce A, Mock Karen E
Department of Biology, Utah State University, Logan, Utah, United States of America; Ecology Center, Utah State University, Logan, Utah, United States of America.
Department of Biology, Utah State University, Logan, Utah, United States of America.
PLoS One. 2014 Apr 18;9(4):e95292. doi: 10.1371/journal.pone.0095292. eCollection 2014.
Continuing advances in nucleotide sequencing technology are inspiring a suite of genomic approaches in studies of natural populations. Researchers are faced with data management and analytical scales that are increasing by orders of magnitude. With such dramatic advances comes a need to understand biases and error rates, which can be propagated and magnified in large-scale data acquisition and processing. Here we assess genomic sampling biases and the effects of various population-level data filtering strategies in a genotyping-by-sequencing (GBS) protocol. We focus on data from two species of Populus, because this genus has a relatively small genome and is emerging as a target for population genomic studies. We estimate the proportions and patterns of genomic sampling by examining the Populus trichocarpa genome (Nisqually-1), and demonstrate a pronounced bias towards coding regions when using the methylation-sensitive ApeKI restriction enzyme in this species. Using population-level data from a closely related species (P. tremuloides), we also investigate various approaches for filtering GBS data to retain high-depth, informative SNPs that can be used for population genetic analyses. We find a data filter that includes the designation of ambiguous alleles resulted in metrics of population structure and Hardy-Weinberg equilibrium that were most consistent with previous studies of the same populations based on other genetic markers. Analyses of the filtered data (27,910 SNPs) also resulted in patterns of heterozygosity and population structure similar to a previous study using microsatellites. Our application demonstrates that technically and analytically simple approaches can readily be developed for population genomics of natural populations.
核苷酸测序技术的不断进步正在推动一系列用于自然种群研究的基因组方法的发展。研究人员面临的数据管理和分析规模正以数量级的速度增长。随着这些巨大的进步,人们需要了解偏差和错误率,因为它们可能在大规模数据采集和处理过程中传播并放大。在这里,我们评估了基因组采样偏差以及基因分型测序(GBS)方案中各种群体水平数据过滤策略的效果。我们专注于两种杨树的数据,因为该属的基因组相对较小,并且正在成为群体基因组研究的目标。我们通过检查毛果杨基因组(尼斯夸利-1)来估计基因组采样的比例和模式,并证明在该物种中使用甲基化敏感的ApeKI限制酶时,对编码区存在明显的偏差。利用来自近缘物种(颤杨)的群体水平数据,我们还研究了过滤GBS数据的各种方法,以保留可用于群体遗传分析的高深度、信息丰富的单核苷酸多态性(SNP)。我们发现一种包含模糊等位基因指定的数据过滤器,其群体结构和哈迪-温伯格平衡指标与之前基于其他遗传标记对相同群体的研究最为一致。对过滤后的数据(27910个SNP)的分析也得出了与之前使用微卫星的研究相似的杂合性和群体结构模式。我们的应用表明,可以很容易地开发出技术和分析上简单的方法用于自然种群的群体基因组学研究。