Dép. de Phytologie and Institut de Biologie Intégrative et des Systèmes, Univ. Laval, Quebec City, QC, Canada, G1V 0A6.
CÉROM, Centre de recherche sur les grains Inc., 740 chemin Trudeau, Saint-Mathieu-de-Beloeil, Canada, QC, J3G 0E2.
Plant Genome. 2019 Nov;12(3):1-11. doi: 10.3835/plantgenome2018.08.0061.
A gene-centric approach for haplotype definition was developed and implemented in R. The tool allows for allelic characterization at given loci in germplasm collections. Allelic status at four maturity genes is predicted on the basis of marker genotyping data. Assessing the allelic diversity within a germplasm collection and identifying individuals carrying favorable alleles is challenging. Advances in high-throughput technologies allow the genotyping of many individuals for thousands of markers but bridging the gap between single nucleotide polymorphisms (SNPs) and relevant alleles remains difficult. We developed a systematic approach that defines haplotypes from large SNP catalogs that aims to identify haplotypes that can be equated to alleles at given genes. Unlike haplotype visualization tools, our approach selects SNP markers that flank a gene and define haplotypes that correspond to this gene's alleles. We tested this approach on four known soybean [Glycine max (L.) Merr.] maturity genes (E1, GmGia, GmPhyA3, and GmPhyA2) in a collection of 67 lines and two genotypic datasets [a SNP array and genotyping-by-sequencing (GBS)]. For E1, GmGia, and GmPhyA3, we identified SNP haplotypes such that the allele found at these genes could be accurately predicted from the haplotype in 97.3% of the cases. For these genes, of the 12 known alleles in the collection, 10 and 8 could be correctly predicted from the haplotypes found with the SNP array and GBS datasets, with success rates of 98 and 97% for all allele-line combinations, respectively. The approach proved equally successful for data derived from a SNP array and GBS. However, in the case of GmPhyA2, a lack of markers in the genomic region prevented the identification of alleles, regardless of the dataset. We demonstrate the feasibility and reproducibility of our approach and identify limits to its applicability.
我们开发并在 R 中实现了一种基于基因的单体型定义方法。该工具允许在种质资源中给定基因座上对等位基因进行特征描述。基于标记基因型数据预测四个成熟基因的等位基因状态。评估种质资源内的等位基因多样性并识别携带有利等位基因的个体具有挑战性。高通量技术的进步允许对数千个标记的许多个体进行基因分型,但在单核苷酸多态性(SNP)和相关等位基因之间仍存在困难。我们开发了一种系统方法,从大型 SNP 目录中定义单体型,旨在识别可以等同于特定基因等位基因的单体型。与单体型可视化工具不同,我们的方法选择侧翼基因的 SNP 标记并定义与该基因等位基因相对应的单体型。我们在 67 个系和两个基因型数据集[SNP 芯片和测序分型(GBS)]上对四个已知的大豆[Glycine max(L.)Merr.]成熟基因(E1、GmGia、GmPhyA3 和 GmPhyA2)进行了测试。对于 E1、GmGia 和 GmPhyA3,我们确定了 SNP 单体型,使得在 97.3%的情况下,可以从单体型准确预测这些基因中的等位基因。在该收集的 12 个已知等位基因中,有 10 个和 8 个可以分别从 SNP 芯片和 GBS 数据集找到的单体型中准确预测,所有等位基因-系组合的成功率分别为 98%和 97%。该方法对于来自 SNP 芯片和 GBS 的数据同样有效。然而,对于 GmPhyA2,由于基因组区域中缺乏标记,无论数据集如何,都无法确定等位基因。我们证明了我们方法的可行性和可重复性,并确定了其适用性的限制。