AdaptiveGS：一种基于自适应堆叠集成机器学习的可解释基因组选择框架。

AdaptiveGS: an explainable genomic selection framework based on adaptive stacking ensemble machine learning.

作者信息

Yang Zhen, Song Mei, Huang Xianggeng, Rao Quanrui, Zhang Shanghui, Zhang Zhongzheng, Wang Chenyang, Li Wenjia, Qin Ran, Zhao Chunhua, Wu Yongzhen, Sun Han, Liu Guangchen, Cui Fa

机构信息

School of Mathematics and Statistics, Ludong University, Yantai, 264025, Shandong, China.

School of Information and Electrical Engineering, Ludong University, Yantai, 264025, Shandong, China.

出版信息

Theor Appl Genet. 2025 Aug 7;138(9):204. doi: 10.1007/s00122-025-04991-z.

DOI:10.1007/s00122-025-04991-z

PMID:40772967

Abstract

We developed an adaptive and unified stacking genomic selection framework and designed a model interpretation strategy to identify the candidate significant SNPs of target traits. Genomic selection (GS) is an important technique in modern molecular breeding. As a powerful machine learning (ML) GS approach, stacking ensemble learning (SEL) combines multiple basic models (base learners, BLs) and effectively blends the strengths of different models to precisely depict the complex relationships between phenotypes and genotypes. However, in the key step of the SEL, there is currently a lack of an effective and unified framework for the selection of BLs. We developed adaptiveGS, an adaptive and explainable data-driven BLs selection strategy for the first time, to pre-screen the optimal BLs for stacking GS framework and improve the prediction accuracy. The adaptiveGS is performed based on the PR index, leveraging the Pearson correlation coefficient (PCC) and the normalized root mean square error (NRMSE), and the top 3 out of 7 (or self-setting) ML are tailored to be BLs via the PR index. We compared the adaptiveGS with 13 other GS algorithms based on a total of 21 traits (datasets) from 4 species. The results showed that adaptiveGS outperformed the 13 models on most of the 21 traits, with the average prediction accuracy (PCC) reaching 0.703, an average improvement of 14.4%, demonstrating superior predictive accuracy and robustness. Furthermore, the SHapley Additive exPlanations (SHAP) technique was utilized to interpret the adaptiveGS and identify significant SNPs for trait variations and potential interaction effects between SNPs. The adaptiveGS provides an operable and unified solution for stacking GS users to improve prediction accuracy in the breeding field. The adaptiveGS package is accessible at https://github.com/yangzhen0117/adaptiveGS .

摘要

我们开发了一种自适应且统一的堆叠基因组选择框架，并设计了一种模型解释策略来识别目标性状的候选显著单核苷酸多态性（SNP）。基因组选择（GS）是现代分子育种中的一项重要技术。作为一种强大的机器学习（ML）GS方法，堆叠集成学习（SEL）结合了多个基本模型（基础学习器，BLs），并有效地融合了不同模型的优势，以精确描述表型和基因型之间的复杂关系。然而，在SEL的关键步骤中，目前缺乏一种有效且统一的BLs选择框架。我们首次开发了adaptiveGS，这是一种自适应且可解释的数据驱动的BLs选择策略，用于为堆叠GS框架预筛选最佳BLs并提高预测准确性。adaptiveGS基于PR指数执行，利用皮尔逊相关系数（PCC）和归一化均方根误差（NRMSE），通过PR指数从7个（或自行设定）ML中挑选出前3个作为BLs。我们基于来自4个物种的总共21个性状（数据集），将adaptiveGS与其他13种GS算法进行了比较。结果表明，在21个性状中的大多数上，adaptiveGS优于这13个模型，平均预测准确性（PCC）达到0.703，平均提高了14.4%，显示出卓越的预测准确性和稳健性。此外，利用SHapley加法解释（SHAP）技术来解释adaptiveGS，并识别性状变异的显著SNP以及SNP之间的潜在相互作用效应。adaptiveGS为堆叠GS用户提供了一种可操作且统一的解决方案，以提高育种领域的预测准确性。可在https://github.com/yangzhen0117/adaptiveGS获取adaptiveGS软件包。