整合机器学习与全基因组关联研究以探究燕麦（Avena sativa L.）农艺性状的基因组预测准确性。

Integration of machine learning and genome-wide association study to explore the genomic prediction accuracy of agronomic trait in oats (Avena sativa L.).

作者信息

Peng Jinghan, Lei Xiong, Liu Tianqi, Xiong Yi, Wu Jiqiang, Xiong Yanli, You Minghong, Zhao Junming, Zhang Jian, Ma Xiao

机构信息

College of Grassland Science and Technology, Sichuan Agricultural University, Chengdu, China.

Sichuan Academy of Grassland Science, Chengdu, China.

出版信息

Plant Genome. 2025 Mar;18(1):e20549. doi: 10.1002/tpg2.20549.

DOI:10.1002/tpg2.20549

PMID:39780036

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11711298/

Abstract

Machine learning (ML) has garnered significant attention for its potential to enhance the accuracy of genomic predictions (GPs) in various economic crops with the use of complete genomic information. Genome-wide association studies (GWAS) are widely used to pinpoint trait-related causal variant loci in genomes. However, the simultaneous integration of both methods for crop genome prediction necessitates further research. In this study, we integrated ML and GWAS to assess the efficiency of GP for seven key agronomic traits in 195 oat (Avena sativa) cultivars from major oat-growing regions around the world. A total of 94 trait-associated single nucleotide polymorphisms were identified through the GWAS study. GP studies were conducted using the classical model genomic best linear unbiased prediction (GBLUP) and six ML models. GBLUP performed poorly in predicting all traits except flag leaf width, while none of the ML models consistently provided the best prediction accuracy across all traits. The prediction accuracy of the GWAS-derived markers was better than that of the use of genome-wide markers, and plant height had the highest prediction rate at 100 GWAS-derived markers, and the rest of the traits for which more markers were required. These results play an important role in advancing the use of GP in small oat breeding programs by optimizing the prediction rate of GP and reducing the number of markers, confirming that high prediction rates can be achieved with smaller datasets.

摘要

机器学习（ML）因其利用完整基因组信息提高各种经济作物基因组预测（GP）准确性的潜力而备受关注。全基因组关联研究（GWAS）被广泛用于确定基因组中与性状相关的因果变异位点。然而，将这两种方法同时用于作物基因组预测仍需进一步研究。在本研究中，我们整合了ML和GWAS，以评估来自世界主要燕麦种植区的195个燕麦（Avena sativa）品种的七个关键农艺性状的GP效率。通过GWAS研究共鉴定出94个与性状相关的单核苷酸多态性。使用经典模型基因组最佳线性无偏预测（GBLUP）和六个ML模型进行了GP研究。GBLUP在预测除旗叶宽度外的所有性状时表现不佳，而没有一个ML模型在所有性状上都始终提供最佳预测准确性。GWAS衍生标记的预测准确性优于全基因组标记的使用，在100个GWAS衍生标记时株高的预测率最高，其他性状则需要更多标记。这些结果通过优化GP的预测率和减少标记数量，在推进GP在小燕麦育种计划中的应用方面发挥了重要作用，证实了使用较小数据集也能实现高预测率。