在一个模拟的家畜群体中，使用GBLUP或机器学习模型将因果变异信息纳入基因组预测。

Incorporating information of causal variants in genomic prediction using GBLUP or machine learning models in a simulated livestock population.

作者信息

Yang Jifan, Calus Mario P L, Wientjes Yvonne C J, Meuwissen Theo H E, Duenk Pascal

机构信息

Animal Breeding and Genomics, Wageningen University & Research, Wageningen, 6700 AH, The Netherlands.

Faculty of Life Sciences, Norwegian University of Life Sciences, Ås, 1432, Norway.

出版信息

J Anim Sci Biotechnol. 2025 Aug 19;16(1):118. doi: 10.1186/s40104-025-01250-5.

DOI:10.1186/s40104-025-01250-5

PMID:40826369

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12362903/

Abstract

BACKGROUND

Genomic prediction has revolutionized animal breeding, with GBLUP being the most widely used prediction model. In theory, the accuracy of genomic prediction could be improved by incorporating information from QTL. This strategy could be especially beneficial for machine learning models that are able to distinguish informative from uninformative features. The objective of this study was to assess the benefit of incorporating QTL genotypes in GBLUP and machine learning models. This study simulated a selected livestock population where QTL and their effects were known. We used four genomic prediction models, GBLUP, (weighted) 2GBLUP, random forest (RF), and support vector regression (SVR) to predict breeding values of young animals, and considered different scenarios that varied in the proportion of genetic variance explained by the included QTL.

RESULTS

2GBLUP resulted in the highest accuracy. Its accuracy increased when the included QTL explained up to 80% of the genetic variance, after which the accuracy dropped. With a weighted 2GBLUP model, the accuracy always increased when more QTL were included. Prediction accuracy of GBLUP was consistently higher than SVR, and the accuracy for both models slightly increased with more QTL information included. The RF model resulted in the lowest prediction accuracy, and did not improve by including QTL information.

CONCLUSIONS

Our results show that incorporating QTL information in GBLUP and SVR can improve prediction accuracy, but the extent of improvement varies across models. RF had a much lower prediction accuracy than the other models and did not show improvements when QTL information was added. Two possible reasons for this result are that the data structure in our data does not allow RF to fully realize its potential and that RF is not designed well for this particular prediction problem. Our study highlighted the importance of selecting appropriate models for genomic prediction and underscored the potential limitations of machine learning models when applied to genomic prediction in livestock.

摘要

背景

基因组预测彻底改变了动物育种，基因组最佳线性无偏预测（GBLUP）是应用最广泛的预测模型。理论上，通过纳入数量性状基因座（QTL）的信息可提高基因组预测的准确性。该策略对于能够区分信息性特征和非信息性特征的机器学习模型可能特别有益。本研究的目的是评估在GBLUP和机器学习模型中纳入QTL基因型的益处。本研究模拟了一个已知QTL及其效应的选定家畜群体。我们使用四种基因组预测模型，即GBLUP、（加权）2GBLUP、随机森林（RF）和支持向量回归（SVR）来预测幼畜的育种值，并考虑了不同的情景，这些情景中所包含QTL解释的遗传方差比例各不相同。

结果

2GBLUP的准确性最高。当所包含的QTL解释高达80%的遗传方差时，其准确性会提高，之后准确性下降。对于加权2GBLUP模型，纳入更多QTL时准确性总是会提高。GBLUP的预测准确性始终高于SVR，并且随着纳入更多QTL信息，两种模型的准确性均略有提高。RF模型的预测准确性最低，并且纳入QTL信息后并未得到改善。

结论

我们的结果表明，在GBLUP和SVR中纳入QTL信息可以提高预测准确性，但提高的程度因模型而异。RF的预测准确性远低于其他模型，并且添加QTL信息时未显示出改善。该结果的两个可能原因是，我们数据中的数据结构不允许RF充分发挥其潜力，以及RF针对此特定预测问题的设计不佳。我们的研究强调了为基因组预测选择合适模型的重要性，并强调了机器学习模型应用于家畜基因组预测时的潜在局限性。