Wu Chengxiu, Luo Jingyun, Xiao Yingjie
National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070 China.
Hubei Hongshan Laboratory, Wuhan, 430070 China.
Mol Breed. 2024 Feb 8;44(2):14. doi: 10.1007/s11032-024-01454-z. eCollection 2024 Feb.
With the improvement of high-throughput technologies in recent years, large multi-dimensional plant omics data have been produced, and big-data-driven yield prediction research has received increasing attention. Machine learning offers promising computational and analytical solutions to interpret the biological meaning of large amounts of data in crops. In this study, we utilized multi-omics datasets from 156 maize recombinant inbred lines, containing 2496 single nucleotide polymorphisms (SNPs), 46 image traits (i-traits) from 16 developmental stages obtained through an automatic phenotyping platform, and 133 primary metabolites. Based on benchmark tests with different types of prediction models, some machine learning methods, such as Partial Least Squares (PLS), Random Forest (RF), and Gaussian process with Radial basis function kernel (GaussprRadial), achieved better prediction for maize yield, albeit slight difference for method preferences among i-traits, genomic, and metabolic data. We found that better yield prediction may be caused by various capabilities in ranking and filtering data features, which is found to be linked with biological meaning such as photosynthesis-related or kernel development-related regulations. Finally, by integrating multiple omics data with the RF machine learning approach, we can further improve the prediction accuracy of grain yield from 0.32 to 0.43. Our research provides new ideas for the application of plant omics data and artificial intelligence approaches to facilitate crop genetic improvements.
The online version contains supplementary material available at 10.1007/s11032-024-01454-z.
近年来,随着高通量技术的改进,已产生了大量多维度的植物组学数据,大数据驱动的产量预测研究受到越来越多的关注。机器学习为解释作物中大量数据的生物学意义提供了有前景的计算和分析解决方案。在本研究中,我们利用了来自156个玉米重组自交系的多组学数据集,其中包括2496个单核苷酸多态性(SNP)、通过自动表型平台在16个发育阶段获得的46个图像性状(i-性状)以及133种初级代谢产物。基于对不同类型预测模型的基准测试,一些机器学习方法,如偏最小二乘法(PLS)、随机森林(RF)和具有径向基函数核的高斯过程(GaussprRadial),对玉米产量实现了更好的预测,尽管在i-性状、基因组和代谢数据之间对方法的偏好略有差异。我们发现,更好的产量预测可能是由在排序和筛选数据特征方面的各种能力导致的,这些能力被发现与光合作用相关或籽粒发育相关调控等生物学意义有关。最后,通过将多个组学数据与RF机器学习方法相结合,我们可以将籽粒产量的预测准确率从0.32进一步提高到0.43。我们的研究为植物组学数据和人工智能方法在促进作物遗传改良方面的应用提供了新思路。
在线版本包含可在10.1007/s11032-024-01454-z获取的补充材料。