Torkamaneh Davoud, Belzile Francois
Département de Phytologie and Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC, Canada.
PLoS One. 2015 Jul 10;10(7):e0131533. doi: 10.1371/journal.pone.0131533. eCollection 2015.
Genotyping-by-sequencing (GBS) represents a highly cost-effective high-throughput genotyping approach. By nature, however, GBS is subject to generating sizeable amounts of missing data and these will need to be imputed for many downstream analyses. The extent to which such missing data can be tolerated in calling SNPs has not been explored widely. In this work, we first explore the use of imputation to fill in missing genotypes in GBS datasets. Importantly, we use whole genome resequencing data to assess the accuracy of the imputed data. Using a panel of 301 soybean accessions, we show that over 62,000 SNPs could be called when tolerating up to 80% missing data, a five-fold increase over the number called when tolerating up to 20% missing data. At all levels of missing data examined (between 20% and 80%), the resulting SNP datasets were of uniformly high accuracy (96-98%). We then used imputation to combine complementary SNP datasets derived from GBS and a SNP array (SoySNP50K). We thus produced an enhanced dataset of >100,000 SNPs and the genotypes at the previously untyped loci were again imputed with a high level of accuracy (95%). Of the >4,000,000 SNPs identified through resequencing 23 accessions (among the 301 used in the GBS analysis), 1.4 million tag SNPs were used as a reference to impute this large set of SNPs on the entire panel of 301 accessions. These previously untyped loci could be imputed with around 90% accuracy. Finally, we used the 100K SNP dataset (GBS + SoySNP50K) to perform a GWAS on seed oil content within this collection of soybean accessions. Both the number of significant marker-trait associations and the peak significance levels were improved considerably using this enhanced catalog of SNPs relative to a smaller catalog resulting from GBS alone at ≤20% missing data. Our results demonstrate that imputation can be used to fill in both missing genotypes and untyped loci with very high accuracy and that this leads to more powerful genetic analyses.
基于测序的基因分型(GBS)是一种极具成本效益的高通量基因分型方法。然而,从本质上讲,GBS容易产生大量缺失数据,而在许多下游分析中需要对这些数据进行估算。在单核苷酸多态性(SNP)检测中,此类缺失数据能够被容忍的程度尚未得到广泛研究。在这项工作中,我们首先探索使用估算方法来填补GBS数据集中的缺失基因型。重要的是,我们使用全基因组重测序数据来评估估算数据的准确性。利用一组301份大豆种质,我们发现当容忍高达80%的缺失数据时,可以检测出超过62000个SNP,这一数量是容忍高达20%缺失数据时检测数量的五倍。在所有检测的缺失数据水平(20%至80%)下,所得的SNP数据集均具有一致的高精度(96% - 98%)。然后,我们使用估算方法将源自GBS和SNP芯片(SoySNP50K)的互补SNP数据集进行合并。因此,我们生成了一个超过100000个SNP的增强数据集,并且先前未分型位点的基因型再次被高精度地估算(95%)。在通过对23份种质(在用于GBS分析的301份种质中)进行重测序鉴定出的超过4000000个SNP中,140万个标签SNP被用作参考,以估算整个301份种质群体中的这一大组SNP。这些先前未分型的位点能够以约90%的准确性进行估算。最后,我们使用100K SNP数据集(GBS + SoySNP50K)对该大豆种质群体的种子油含量进行全基因组关联研究(GWAS)。相对于仅使用GBS且缺失数据≤20%时得到的较小数据集,使用这个增强的SNP目录,显著的标记 - 性状关联数量和峰值显著水平都有了相当大的提高。我们的结果表明,估算可用于以非常高的准确性填补缺失基因型和未分型位点,并且这会带来更强大的遗传分析。