利用机器学习进行甘蔗基因组预测，甘蔗是一种复杂的高度多倍体克隆繁殖作物，其关键性状存在大量非加性变异。

Genomic prediction with machine learning in sugarcane, a complex highly polyploid clonally propagated crop with substantial non-additive variation for key traits.

机构信息

Queensland Alliance for Agriculture and Food Innovation, University of Queensland, Queensland, Australia.

Sugar Research Australia, Mackay, Australia.

出版信息

Plant Genome. 2023 Dec;16(4):e20390. doi: 10.1002/tpg2.20390. Epub 2023 Sep 20.

DOI:10.1002/tpg2.20390

PMID:37728221

Abstract

Sugarcane has a complex, highly polyploid genome with multi-species ancestry. Additive models for genomic prediction of clonal performance might not capture interactions between genes and alleles from different ploidies and ancestral species. As such, genomic prediction in sugarcane presents an interesting case for machine learning (ML) methods, which are purportedly able to deal with high levels of complexity in prediction. Here, we investigated deep learning (DL) neural networks, including multilayer networks (MLP) and convolution neural networks (CNN), and an ensemble machine learning approach, random forest (RF), for genomic prediction in sugarcane. The data set used was 2912 sugarcane clones, scored for 26,086 genome wide single nucleotide polymorphism markers, with final assessment trial data for total cane harvested (TCH), commercial cane sugar (CCS), and fiber content (Fiber). The clones in the latest trial (2017) were used as a validation set. We compared prediction accuracy of these methods to genomic best linear unbiased prediction (GBLUP) extended to include dominance and epistatic effects. The prediction accuracies from GBLUP models were up to 0.37 for TCH, 0.43 for CCS, and 0.48 for Fiber, while the optimized ML models had prediction accuracies of 0.35 for TCH, 0.38 for CCS, and 0.48 for Fiber. Both RF and DL neural network models have comparable predictive ability with the additive GBLUP model but are less accurate than the extended GBLUP model.

摘要

甘蔗具有复杂的、高度多倍体基因组，具有多物种起源。用于克隆性能基因组预测的加性模型可能无法捕捉来自不同倍性和祖先物种的基因和等位基因之间的相互作用。因此，甘蔗的基因组预测为机器学习 (ML) 方法提供了一个有趣的案例，据称这些方法能够处理预测中的高水平复杂性。在这里，我们研究了深度学习 (DL) 神经网络，包括多层网络 (MLP) 和卷积神经网络 (CNN)，以及一种集成机器学习方法，随机森林 (RF)，用于甘蔗的基因组预测。使用的数据集是 2912 个甘蔗克隆，对 26086 个全基因组单核苷酸多态性标记进行评分，最终评估试验数据为总 cane 收获量 (TCH)、商业 cane 糖 (CCS) 和纤维含量 (Fiber)。最新试验 (2017 年) 中的克隆被用作验证集。我们将这些方法的预测准确性与包括显性和上位性效应的基因组最佳线性无偏预测 (GBLUP) 进行了比较。GBLUP 模型的预测准确性高达 TCH 为 0.37，CCS 为 0.43，纤维为 0.48，而优化的 ML 模型的预测准确性为 TCH 为 0.35，CCS 为 0.38，纤维为 0.48。RF 和 DL 神经网络模型都具有与加性 GBLUP 模型相当的预测能力，但准确性低于扩展 GBLUP 模型。