Queensland Alliance for Agriculture and Food Innovation, Queensland Bioscience Precinct, 306 Carmody Rd., St. Lucia, Brisbane, Queensland, 4067, Australia.
Agriculture and Food, CSIRO, Queensland Bioscience Precinct, St. Lucia, Brisbane, Queensland, 4067, Australia.
BMC Genomics. 2021 Oct 29;22(1):773. doi: 10.1186/s12864-021-08116-w.
High-density SNP arrays are now available for a wide range of crop species. Despite the development of many tools for generating genetic maps, the genome position of many SNPs from these arrays is unknown. Here we propose a linkage disequilibrium (LD)-based algorithm to allocate unassigned SNPs to chromosome regions from sparse genetic maps. This algorithm was tested on sugarcane, wheat, and barley data sets. We calculated the algorithm's efficiency by masking SNPs with known locations, then assigning their position to the map with the algorithm, and finally comparing the assigned and true positions.
In the 20-fold cross-validation, the mean proportion of masked mapped SNPs that were placed by the algorithm to a chromosome was 89.53, 94.25, and 97.23% for sugarcane, wheat, and barley, respectively. Of the markers that were placed in the genome, 98.73, 96.45 and 98.53% of the SNPs were positioned on the correct chromosome. The mean correlations between known and new estimated SNP positions were 0.97, 0.98, and 0.97 for sugarcane, wheat, and barley. The LD-based algorithm was used to assign 5920 out of 21,251 unpositioned markers to the current Q208 sugarcane genetic map, representing the highest density genetic map for this species to date.
Our LD-based approach can be used to accurately assign unpositioned SNPs to existing genetic maps, improving genome-wide association studies and genomic prediction in crop species with fragmented and incomplete genome assemblies. This approach will facilitate genomic-assisted breeding for many orphan crops that lack genetic and genomic resources.
高密度 SNP 芯片现在可用于多种作物物种。尽管已经开发了许多用于生成遗传图谱的工具,但这些图谱中的许多 SNP 的基因组位置仍然未知。在这里,我们提出了一种基于连锁不平衡(LD)的算法,将未分配的 SNP 分配到来自稀疏遗传图谱的染色体区域。该算法在甘蔗、小麦和大麦数据集上进行了测试。我们通过屏蔽具有已知位置的 SNP 来计算算法的效率,然后使用算法将其位置分配到图谱上,最后比较分配的和真实的位置。
在 20 倍交叉验证中,算法将掩蔽的已知位置 SNP 分配到染色体的平均比例分别为甘蔗、小麦和大麦的 89.53%、94.25%和 97.23%。在放置在基因组中的标记中,98.73%、96.45%和 98.53%的 SNP 位于正确的染色体上。已知和新估计 SNP 位置之间的平均相关性分别为甘蔗、小麦和大麦的 0.97、0.98 和 0.97。基于 LD 的算法将 21251 个未定位标记中的 5920 个分配到当前的 Q208 甘蔗遗传图谱中,这是迄今为止该物种密度最高的遗传图谱。
我们的基于 LD 的方法可用于将未定位的 SNP 准确分配到现有遗传图谱中,从而改善具有碎片化和不完整基因组组装的作物物种的全基因组关联研究和基因组预测。这种方法将促进许多缺乏遗传和基因组资源的孤儿作物的基因组辅助育种。