Melo Arthur T O, Bartaula Radhika, Hale Iago
College of Life Sciences and Agriculture, Department of Biological Sciences, University of New Hampshire, Durham, NH, USA.
College of Life Sciences and Agriculture, Genetics Graduate Program, University of New Hampshire, Durham, NH, USA.
BMC Bioinformatics. 2016 Jan 12;17:29. doi: 10.1186/s12859-016-0879-y.
With its simple library preparation and robust approach to genome reduction, genotyping-by-sequencing (GBS) is a flexible and cost-effective strategy for SNP discovery and genotyping, provided an appropriate reference genome is available. For resource-limited curation, research, and breeding programs of underutilized plant genetic resources, however, even low-depth references may not be within reach, despite declining sequencing costs. Such programs would find value in an open-source bioinformatics pipeline that can maximize GBS data usage and perform high-density SNP genotyping in the absence of a reference.
The GBS SNP-Calling Reference Optional Pipeline (GBS-SNP-CROP) developed and presented here adopts a clustering strategy to build a population-tailored "Mock Reference" from the same GBS data used for downstream SNP calling and genotyping. Designed for libraries of paired-end (PE) reads, GBS-SNP-CROP maximizes data usage by eliminating unnecessary data culling due to imposed read-length uniformity requirements. Using 150 bp PE reads from a GBS library of 48 accessions of tetraploid kiwiberry (Actinidia arguta), GBS-SNP-CROP yielded on average three times as many SNPs as TASSEL-GBS analyses (32 and 64 bp tag lengths) and over 18 times as many as TASSEL-UNEAK, with fewer genotyping errors in all cases, as evidenced by comparing the genotypic characterizations of biological replicates. Using the published reference genome of a related diploid species (A. chinensis), the reference-based version of GBS-SNP-CROP behaved similarly to TASSEL-GBS in terms of the number of SNPs called but had an improved read depth distribution and fewer genotyping errors. Our results also indicate that the sets of SNPs detected by the different pipelines above are largely orthogonal to one another; thus GBS-SNP-CROP may be used to augment the results of alternative analyses, whether or not a reference is available.
By achieving high-density SNP genotyping in populations for which no reference genome is available, GBS-SNP-CROP is worth consideration by curators, researchers, and breeders of under-researched plant genetic resources. In cases where a reference is available, especially if from a related species or when the target population is particularly diverse, GBS-SNP-CROP may complement other reference-based pipelines by extracting more information per sequencing dollar spent. The current version of GBS-SNP-CROP is available at https://github.com/halelab/GBS-SNP-CROP.git.
基于测序的基因分型(GBS)具有简单的文库制备方法和强大的基因组简化策略,是一种灵活且经济高效的SNP发现和基因分型策略,前提是有合适的参考基因组。然而,对于未充分利用的植物遗传资源的资源有限的管理、研究和育种计划而言,尽管测序成本不断下降,但即使是低深度的参考基因组可能也无法获得。这样的计划会从一个开源生物信息学流程中找到价值,该流程可以最大限度地利用GBS数据,并在没有参考基因组的情况下进行高密度SNP基因分型。
本文开发并展示的GBS SNP分型参考可选流程(GBS-SNP-CROP)采用聚类策略,从用于下游SNP分型和基因分型的相同GBS数据中构建一个针对群体定制的“模拟参考基因组”。GBS-SNP-CROP专为双端(PE) reads文库设计,通过消除因强制要求读长一致性而导致的不必要数据剔除,最大限度地提高了数据利用率。使用来自48个四倍体猕猴桃(软枣猕猴桃)GBS文库的150 bp PE reads,GBS-SNP-CROP产生的SNP数量平均是TASSEL-GBS分析(标签长度为32和64 bp)的三倍,是TASSEL-UNEAK的18倍多,并且在所有情况下基因分型错误都更少,通过比较生物学重复的基因型特征可以证明这一点。使用相关二倍体物种(中华猕猴桃)已发表的参考基因组,基于参考基因组的GBS-SNP-CROP版本在调用的SNP数量方面与TASSEL-GBS表现相似,但具有更好的读深度分布和更少的基因分型错误。我们的结果还表明,上述不同流程检测到的SNP集在很大程度上彼此正交;因此,无论是否有参考基因组,GBS-SNP-CROP都可用于增强其他分析结果。
通过在没有参考基因组的群体中实现高密度SNP基因分型,GBS-SNP-CROP值得未充分研究的植物遗传资源的管理者、研究人员和育种者考虑。在有参考基因组的情况下,特别是如果参考基因组来自相关物种或目标群体特别多样化时,GBS-SNP-CROP可以通过每花费一美元测序提取更多信息来补充其他基于参考基因组的流程。GBS-SNP-CROP的当前版本可在https://github.com/halelab/GBS-SNP-CROP.git获取。