Department of Plant and Soil Sciences, University of Delaware, Newark, Delaware 19716.
Program of Computational Genomics and Medicine, North Shore University Health System, Evanston, Illinois 60201.
G3 (Bethesda). 2017 Jul 5;7(7):2161-2170. doi: 10.1534/g3.117.042036.
High-throughput sequencing (HTS) of reduced representation genomic libraries has ushered in an era of genotyping-by-sequencing (GBS), where genome-wide genotype data can be obtained for nearly any species. However, there remains a need for imputation-free GBS methods for genotyping large samples taken from heterogeneous populations of heterozygous individuals. This requires that a number of issues encountered with GBS be considered, including the sequencing of nonoverlapping sets of loci across multiple GBS libraries, a common missing data problem that results in low call rates for markers per individual, and a tendency for applicability only in inbred line samples with sufficient linkage disequilibrium for accurate imputation. We addressed these issues while developing and validating a new, comprehensive platform for GBS. This study supports the notion that GBS can be tailored to particular aims, and using our results indicate that large samples of unknown pedigree can be genotyped to obtain complete and accurate GBS data. Optimizing size selection to sequence a high proportion of shared loci among individuals in different libraries and using simple filters, a GBS procedure was established that produces high call rates per marker (>85%) with accuracy exceeding 99.4%. Furthermore, by capitalizing on the sequence-read structure of GBS data (stacks of reads), a new tool for resolving local haplotypes and scoring phased genotypes was developed, a feature that is not available in many GBS pipelines. Using local haplotypes reduces the marker dimensionality of the genotype matrix while increasing the informativeness of the data. Phased GBS in maize also revealed the existence of reproducibly inaccurate (apparent accuracy) genotypes that were due to divergent copy number variants (CNVs) unobservable in the underlying single nucleotide polymorphism (SNP) data.
高通量测序(HTS)的简化基因组文库迎来了测序分型(GBS)的时代,几乎可以为任何物种获得全基因组基因型数据。然而,对于从杂合个体的异质群体中获取的大样本进行无插补 GBS 方法仍然存在需求。这需要考虑与 GBS 相关的一些问题,包括在多个 GBS 文库中对非重叠的基因座组进行测序,这是一种常见的缺失数据问题,导致每个个体的标记的调用率较低,并且仅适用于具有足够连锁不平衡的近交系样本,以便进行准确的插补。在开发和验证 GBS 的新的综合平台的过程中,我们解决了这些问题。这项研究支持了以下观点,即 GBS 可以根据特定目标进行定制,并且使用我们的结果表明,可以对未知血统的大样本进行基因分型,以获得完整和准确的 GBS 数据。通过优化大小选择,在不同文库中的个体之间共享的基因座序列比例较高,并使用简单的过滤器,建立了一种 GBS 程序,该程序可实现每个标记的高调用率(>85%),准确性超过 99.4%。此外,通过利用 GBS 数据的序列读取结构(读取堆栈),开发了一种新的用于解析局部单倍型和评分相基因型的工具,这是许多 GBS 管道中不可用的功能。使用局部单倍型降低了基因型矩阵的标记维数,同时提高了数据的信息量。玉米中的相 GBS 还揭示了存在可重现的不准确(表观准确性)基因型,这是由于在潜在的单核苷酸多态性(SNP)数据中不可观察的拷贝数变异(CNV)引起的。