Miao Na, Yang Mengke, Han Pingping, Qiao Jiakun, Che Zhaoxuan, Xu Fangjun, Dai Xiangyu, Zhu Mengjin
Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China.
, 430070.
Bioinform Adv. 2025 Feb 22;5(1):vbaf002. doi: 10.1093/bioadv/vbaf002. eCollection 2025.
Ensemble learning, as a powerful machine learning method, improves overall prediction performance by combining the prediction results of multiple base models. Blending, as a popular ensemble learning method, can train multiple base models, input the resulting prediction results to further train meta model and obtain final prediction results. However, conventional blending divides the training set by simple random sampling, which causes bias and large variance, thus affecting the stability and accuracy of prediction performance. In this study, we propose a new algorithm of stratified sampling blending (ssBlending), which addresses the algorithm instability of conventional blending caused by the random partition of the training set, further improving the prediction accuracy.
We used multiple genotype data sets from different species including animal (pig), plant (loblolly pine), and microorganism (yeast) to test the prediction performance of ssBlending. The across-species multi-dataset verification results reveal that ssBlending is superior to conventional blending in terms of prediction accuracy and stability. In addition, we optimized the training set sampling rate (BestH) to facilitate the practical application of the ssBlending algorithm. In summary, this study proposes a completely new algorithm combing stratification strategy with the conventional blending, which provides more options for ensemble learning in various fields.
集成学习作为一种强大的机器学习方法,通过组合多个基础模型的预测结果来提高整体预测性能。混合法作为一种流行的集成学习方法,可以训练多个基础模型,输入所得的预测结果以进一步训练元模型并获得最终预测结果。然而,传统的混合法通过简单随机抽样划分训练集,这会导致偏差和较大方差,从而影响预测性能的稳定性和准确性。在本研究中,我们提出了一种分层抽样混合(ssBlending)新算法,该算法解决了因训练集随机划分导致的传统混合法算法不稳定性问题,进一步提高了预测准确性。
我们使用了来自不同物种(包括动物(猪)、植物(火炬松)和微生物(酵母))的多个基因型数据集来测试ssBlending的预测性能。跨物种多数据集验证结果表明,ssBlending在预测准确性和稳定性方面优于传统混合法。此外,我们优化了训练集抽样率(BestH)以促进ssBlending算法的实际应用。总之,本研究提出了一种将分层策略与传统混合法相结合的全新算法,为各领域的集成学习提供了更多选择。