利用机器学习实现猪生产性状的遗传位点筛选和基因组预测。

Using machine learning to realize genetic site screening and genomic prediction of productive traits in pigs.

机构信息

Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Laboratory of Swine Genetics and Breeding of Ministry of Agriculture, Huazhong Agricultural University, Wuhan, China.

College of Informatics, Huazhong Agricultural University, Wuhan, China.

出版信息

FASEB J. 2023 Jun;37(6):e22961. doi: 10.1096/fj.202300245R.

DOI:10.1096/fj.202300245R

PMID:37178007

Abstract

Genomic prediction, which is based on solving linear mixed-model (LMM) equations, is the most popular method for predicting breeding values or phenotypic performance for economic traits in livestock. With the need to further improve the performance of genomic prediction, nonlinear methods have been considered as an alternative and promising approach. The excellent ability to predict phenotypes in animal husbandry has been demonstrated by machine learning (ML) approaches, which have been rapidly developed. To investigate the feasibility and reliability of implementing genomic prediction using nonlinear models, the performances of genomic predictions for pig productive traits using the linear genomic selection model and nonlinear machine learning models were compared. Then, to reduce the high-dimensional features of genome sequence data, different machine learning algorithms, including the random forest (RF), support vector machine (SVM), extreme gradient boosting (XGBoost) and convolutional neural network (CNN) algorithms, were used to perform genomic feature selection as well as genomic prediction on reduced feature genome data. All of the analyses were processed on two real pig datasets: the published PIC pig dataset and a dataset comprising data from a national pig nucleus herd in Chifeng, North China. Overall, the accuracies of predicted phenotypic performance for traits T1, T2, T3 and T5 in the PIC dataset and average daily gain (ADG) in the Chifeng dataset were higher using the ML methods than the LMM method, while those for trait T4 in the PIC dataset and total number of piglets born (TNB) in the Chifeng dataset were slightly lower using the ML methods than the LMM method. Among all the different ML algorithms, SVM was the most appropriate for genomic prediction. For the genomic feature selection experiment, the most stable and most accurate results across different algorithms were achieved using XGBoost in combination with the SVM algorithm. Through feature selection, the number of genomic markers can be reduced to 1 in 20, while the predictive performance on some traits can even be improved compared to using the full genome data. Finally, we developed a new tool that can be used to execute combined XGBoost and SVM algorithms to realize genomic feature selection and phenotypic prediction.

摘要

基于求解线性混合模型（LMM）方程的基因组预测是预测家畜经济性状育种值或表型表现的最常用方法。随着对基因组预测性能进一步提高的需求，非线性方法已被视为一种替代方法和有前途的方法。机器学习（ML）方法在畜牧业中表现出了出色的预测表型能力，并且发展迅速。为了研究使用非线性模型进行基因组预测的可行性和可靠性，比较了使用线性基因组选择模型和非线性机器学习模型对猪生产性状进行基因组预测的性能。然后，为了降低基因组序列数据的高维特征，使用不同的机器学习算法，包括随机森林（RF）、支持向量机（SVM）、极端梯度提升（XGBoost）和卷积神经网络（CNN）算法，对减少特征的基因组数据进行基因组特征选择和基因组预测。所有分析均在中国北方赤峰国家猪核心群的数据集和已发表的 PIC 猪数据集上进行。总体而言，在 PIC 数据集的 T1、T2、T3 和 T5 性状和赤峰数据集的平均日增重（ADG）上，ML 方法的预测表型性能准确性高于 LMM 方法，而在 PIC 数据集的 T4 性状和赤峰数据集的总产仔数（TNB）上，ML 方法的预测表型性能准确性略低于 LMM 方法。在所有不同的 ML 算法中，SVM 最适合基因组预测。对于基因组特征选择实验，XGBoost 与 SVM 算法相结合在不同算法中获得了最稳定和最准确的结果。通过特征选择，可以将基因组标记的数量减少到 1/20，而在某些性状上的预测性能甚至可以比使用全基因组数据时得到提高。最后，我们开发了一个新的工具，可以用于执行组合的 XGBoost 和 SVM 算法，以实现基因组特征选择和表型预测。