Heilmann Philipp Georg, Frisch Matthias, Abbadi Amine, Kox Tobias, Herzog Eva
Institute of Agronomy and Plant Breeding II, Justus Liebig University, Gießen, Germany.
NPZ Innovation GmbH, Holtsee, Germany.
Front Plant Sci. 2023 Jul 21;14:1178902. doi: 10.3389/fpls.2023.1178902. eCollection 2023.
Testcross factorials in newly established hybrid breeding programs are often highly unbalanced, incomplete, and characterized by predominance of special combining ability (SCA) over general combining ability (GCA). This results in a low efficiency of GCA-based selection. Machine learning algorithms might improve prediction of hybrid performance in such testcross factorials, as they have been successfully applied to find complex underlying patterns in sparse data. Our objective was to compare the prediction accuracy of machine learning algorithms to that of GCA-based prediction and genomic best linear unbiased prediction (GBLUP) in six unbalanced incomplete factorials from hybrid breeding programs of rapeseed, wheat, and corn. We investigated a range of machine learning algorithms with three different types of predictor variables: (a) information on parentage of hybrids, (b) in addition hybrid performance of crosses of the parental lines with other crossing partners, and (c) genotypic marker data. In two highly incomplete and unbalanced factorials from rapeseed, in which the SCA variance contributed considerably to the genetic variance, stacked ensembles of gradient boosting machines based on parentage information outperformed GCA prediction. The stacked ensembles increased prediction accuracy from 0.39 to 0.45, and from 0.48 to 0.54 compared to GCA prediction. The prediction accuracy reached by stacked ensembles without marker data reached values comparable to those of GBLUP that requires marker data. We conclude that hybrid prediction with stacked ensembles of gradient boosting machines based on parentage information is a promising approach that is worth further investigations with other data sets in which SCA variance is high.
在新建立的杂交育种计划中,测交析因试验往往高度不平衡、不完整,且具有特殊配合力(SCA)高于一般配合力(GCA)的特点。这导致基于GCA的选择效率低下。机器学习算法可能会提高此类测交析因试验中杂种性能的预测,因为它们已成功应用于在稀疏数据中发现复杂的潜在模式。我们的目标是比较机器学习算法与基于GCA的预测以及基因组最佳线性无偏预测(GBLUP)在油菜、小麦和玉米杂交育种计划的六个不平衡不完全析因试验中的预测准确性。我们研究了一系列机器学习算法,使用三种不同类型的预测变量:(a)杂种亲本信息,(b)此外还有亲本系与其他杂交亲本杂交的杂种性能,以及(c)基因型标记数据。在油菜的两个高度不完全且不平衡的析因试验中,SCA方差对遗传方差有很大贡献,基于亲本信息的梯度提升机堆叠集成模型的表现优于GCA预测。与GCA预测相比,堆叠集成模型将预测准确性从0.39提高到0.45,从0.48提高到0.54。没有标记数据的堆叠集成模型达到的预测准确性与需要标记数据的GBLUP相当。我们得出结论,基于亲本信息的梯度提升机堆叠集成模型进行杂种预测是一种很有前景的方法,值得在其他SCA方差较高的数据集中进一步研究。