利用机器学习技术整合遗传和环境数据，以提高玉米籽粒产量在多环境试验中的预测能力。

Using machine learning to combine genetic and environmental data for maize grain yield predictions across multi-environment trials.

机构信息

Department of Crop, Soil, and Environmental Sciences, Center for Agricultural Data Analytics, University of Arkansas, Fayetteville, AR, USA.

Department of Crop, Soil, and Environmental Sciences, University of Arkansas, Fayetteville, AR, USA.

出版信息

Theor Appl Genet. 2024 Jul 23;137(8):189. doi: 10.1007/s00122-024-04687-w.

DOI:10.1007/s00122-024-04687-w

PMID:39044035

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11266441/

Abstract

Incorporating feature-engineered environmental data into machine learning-based genomic prediction models is an efficient approach to indirectly model genotype-by-environment interactions. Complementing phenotypic traits and molecular markers with high-dimensional data such as climate and soil information is becoming a common practice in breeding programs. This study explored new ways to combine non-genetic information in genomic prediction models using machine learning. Using the multi-environment trial data from the Genomes To Fields initiative, different models to predict maize grain yield were adjusted using various inputs: genetic, environmental, or a combination of both, either in an additive (genetic-and-environmental; G+E) or a multiplicative (genotype-by-environment interaction; GEI) manner. When including environmental data, the mean prediction accuracy of machine learning genomic prediction models increased up to 7% over the well-established Factor Analytic Multiplicative Mixed Model among the three cross-validation scenarios evaluated. Moreover, using the G+E model was more advantageous than the GEI model given the superior, or at least comparable, prediction accuracy, the lower usage of computational memory and time, and the flexibility of accounting for interactions by construction. Our results illustrate the flexibility provided by the ML framework, particularly with feature engineering. We show that the feature engineering stage offers a viable option for envirotyping and generates valuable information for machine learning-based genomic prediction models. Furthermore, we verified that the genotype-by-environment interactions may be considered using tree-based approaches without explicitly including interactions in the model. These findings support the growing interest in merging high-dimensional genotypic and environmental data into predictive modeling.

摘要

将经过特征工程处理的环境数据纳入基于机器学习的基因组预测模型中，是间接模拟基因型与环境互作的有效方法。在育种计划中，用气候和土壤等多维数据补充表型特征和分子标记，已成为一种常见做法。本研究通过机器学习探索了在基因组预测模型中组合非遗传信息的新方法。利用 Genomes To Fields 计划的多环境试验数据，使用不同的模型，通过各种输入（遗传、环境或两者的组合）来调整预测玉米籽粒产量的模型，要么以加性（遗传和环境；G+E）方式，要么以乘法（基因型与环境互作；GEI）方式。在包含环境数据的情况下，在三种交叉验证场景中评估的机器学习基因组预测模型的平均预测准确性比既定的因素分析乘法混合模型提高了 7%。此外，与 GEI 模型相比，使用 G+E 模型更具优势，因为前者具有更高的预测准确性，或者至少具有可比性，使用的计算内存和时间更少，并且可以通过构建灵活地考虑交互作用。我们的结果说明了 ML 框架提供的灵活性，特别是在特征工程方面。我们表明，特征工程阶段为环境分型提供了一种可行的选择，并为基于机器学习的基因组预测模型生成了有价值的信息。此外，我们验证了可以使用基于树的方法来考虑基因型与环境互作，而无需在模型中明确包含互作。这些发现支持了将高维基因型和环境数据合并到预测模型中的日益增长的兴趣。