Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland.
Swiss Institute of Bioinformatics, Lausanne, Switzerland.
Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad336.
Developing new crop varieties with superior performance is highly important to ensure robust and sustainable global food security. The speed of variety development is limited by long field cycles and advanced generation selections in plant breeding programs. While methods to predict yield from genotype or phenotype data have been proposed, improved performance and integrated models are needed.
We propose a machine learning model that leverages both genotype and phenotype measurements by fusing genetic variants with multiple data sources collected by unmanned aerial systems. We use a deep multiple instance learning framework with an attention mechanism that sheds light on the importance given to each input during prediction, enhancing interpretability. Our model reaches 0.754 ± 0.024 Pearson correlation coefficient when predicting yield in similar environmental conditions; a 34.8% improvement over the genotype-only linear baseline (0.559 ± 0.050). We further predict yield on new lines in an unseen environment using only genotypes, obtaining a prediction accuracy of 0.386 ± 0.010, a 13.5% improvement over the linear baseline. Our multi-modal deep learning architecture efficiently accounts for plant health and environment, distilling the genetic contribution and providing excellent predictions. Yield prediction algorithms leveraging phenotypic observations during training therefore promise to improve breeding programs, ultimately speeding up delivery of improved varieties.
Available at https://github.com/BorgwardtLab/PheGeMIL (code) and https://doi.org/doi:10.5061/dryad.kprr4xh5p (data).
开发具有卓越性能的新型作物品种对于确保全球粮食安全的稳健和可持续性至关重要。品种开发的速度受到植物育种计划中田间周期长和代际选择先进的限制。虽然已经提出了从基因型或表型数据预测产量的方法,但需要改进性能和综合模型。
我们提出了一种机器学习模型,通过融合遗传变异与无人航空系统收集的多个数据源,利用基因型和表型测量值。我们使用具有注意力机制的深度多实例学习框架,该机制阐明了在预测过程中对每个输入的重视程度,提高了可解释性。当在相似的环境条件下预测产量时,我们的模型达到了 0.754±0.024 的皮尔逊相关系数;与仅基于基因型的线性基线(0.559±0.050)相比,提高了 34.8%。我们进一步仅使用基因型在看不见的环境中预测新的系谱产量,获得了 0.386±0.010 的预测精度,比线性基线提高了 13.5%。我们的多模态深度学习架构有效地考虑了植物健康和环境,提取了遗传贡献并提供了出色的预测。因此,在训练过程中利用表型观测值的产量预测算法有望改进育种计划,最终加快改良品种的交付。
可在 https://github.com/BorgwardtLab/PheGeMIL(代码)和 https://doi.org/doi:10.5061/dryad.kprr4xh5p(数据)上获得。