使用机器学习对猪的剩余采食量进行基因组预测的预测模型的特征选择稳定性和准确性

Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning.

作者信息

Piles Miriam, Bergsma Rob, Gianola Daniel, Gilbert Hélène, Tusell Llibertat

机构信息

Animal Breeding and Genetics Program, Institute of Agriculture and Food Research and Technology (IRTA), Barcelona, Spain.

Topigs Norsvin Research Center, Beuningen, Netherlands.

出版信息

Front Genet. 2021 Feb 22;12:611506. doi: 10.3389/fgene.2021.611506. eCollection 2021.

DOI:10.3389/fgene.2021.611506

PMID:33692825

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7938892/

Abstract

Feature selection (FS, i.e., selection of a subset of predictor variables) is essential in high-dimensional datasets to prevent overfitting of prediction/classification models and reduce computation time and resources. In genomics, FS allows identifying relevant markers and designing low-density SNP chips to evaluate selection candidates. In this research, several univariate and multivariate FS algorithms combined with various parametric and non-parametric learners were applied to the prediction of feed efficiency in growing pigs from high-dimensional genomic data. The objective was to find the best combination of feature selector, SNP subset size, and learner leading to accurate and stable (i.e., less sensitive to changes in the training data) prediction models. Genomic best linear unbiased prediction (GBLUP) without SNP pre-selection was the benchmark. Three types of FS methods were implemented: (i) filter methods: univariate (univ.dtree, spearcor) or multivariate (cforest, mrmr), with random selection as benchmark; (ii) embedded methods: elastic net and least absolute shrinkage and selection operator (LASSO) regression; (iii) combination of filter and embedded methods. Ridge regression, support vector machine (SVM), and gradient boosting (GB) were applied after pre-selection performed with the filter methods. Data represented 5,708 individual records of residual feed intake to be predicted from the animal's own genotype. Accuracy (stability of results) was measured as the median (interquartile range) of the Spearman correlation between observed and predicted data in a 10-fold cross-validation. The best prediction in terms of accuracy and stability was obtained with SVM and GB using 500 or more SNPs [0.28 (0.02) and 0.27 (0.04) for SVM and GB with 1,000 SNPs, respectively]. With larger subset sizes (1,000-1,500 SNPs), the filter method had no influence on prediction quality, which was similar to that attained with a random selection. With 50-250 SNPs, the FS method had a huge impact on prediction quality: it was very poor for tree-based methods combined with any learner, but good and similar to what was obtained with larger SNP subsets when spearcor or mrmr were implemented with or without embedded methods. Those filters also led to very stable results, suggesting their potential use for designing low-density SNP chips for genome-based evaluation of feed efficiency.

摘要

特征选择（即选择预测变量的一个子集）在高维数据集中至关重要，可防止预测/分类模型过度拟合，并减少计算时间和资源。在基因组学中，特征选择有助于识别相关标记并设计低密度SNP芯片以评估候选选择。在本研究中，几种单变量和多变量特征选择算法与各种参数和非参数学习器相结合，应用于从高维基因组数据预测生长猪的饲料效率。目的是找到特征选择器、SNP子集大小和学习器的最佳组合，从而得到准确且稳定（即对训练数据变化不太敏感）的预测模型。未进行SNP预选择的基因组最佳线性无偏预测（GBLUP）作为基准。实施了三种类型的特征选择方法：（i）过滤方法：单变量（单变量决策树、斯皮尔曼相关性）或多变量（随机森林、最大相关最小冗余），以随机选择作为基准；（ii）嵌入式方法：弹性网络和最小绝对收缩和选择算子（LASSO）回归；（iii）过滤方法和嵌入式方法的组合。在用过滤方法进行预选择后，应用岭回归、支持向量机（SVM）和梯度提升（GB）。数据代表了5708条个体记录，这些记录是根据动物自身的基因型预测的剩余饲料摄入量。在10折交叉验证中，准确性（结果的稳定性）通过观察数据和预测数据之间的斯皮尔曼相关性的中位数（四分位间距）来衡量。使用500个或更多SNP时，SVM和GB在准确性和稳定性方面取得了最佳预测结果（使用1000个SNP时，SVM和GB的斯皮尔曼相关性分别为0.28（0.02）和0.27（0.04））。对于较大的子集大小（1000 - 1500个SNP），过滤方法对预测质量没有影响，这与随机选择的结果相似。当有50 - 250个SNP时，特征选择方法对预测质量有巨大影响：对于与任何学习器结合的基于树的方法来说预测质量非常差，但当实施斯皮尔曼相关性或最大相关最小冗余（无论是否结合嵌入式方法）时，预测质量良好且与使用较大SNP子集时相似。这些过滤器还产生了非常稳定的结果，表明它们在设计用于基于基因组评估饲料效率的低密度SNP芯片方面具有潜在用途。

相似文献

Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning.使用机器学习对猪的剩余采食量进行基因组预测的预测模型的特征选择稳定性和准确性

Front Genet. 2021 Feb 22;12:611506. doi: 10.3389/fgene.2021.611506. eCollection 2021.

Impact of multi-output and stacking methods on feed efficiency prediction from genotype using machine learning algorithms.多输出和堆叠方法对使用机器学习算法从基因型预测饲料效率的影响。

J Anim Breed Genet. 2023 Nov;140(6):638-652. doi: 10.1111/jbg.12815. Epub 2023 Jul 5.

Machine Learning Prediction of Crossbred Pig Feed Efficiency and Growth Rate From Single Nucleotide Polymorphisms.基于单核苷酸多态性的杂交猪饲料效率和生长速率的机器学习预测

Front Genet. 2020 Dec 18;11:567818. doi: 10.3389/fgene.2020.567818. eCollection 2020.

Genomic dissection and prediction of feed intake and residual feed intake traits using a longitudinal model in F2 chickens.利用 F2 代鸡的纵向模型进行采食量和剩余采食量性状的基因组剖析和预测。

Animal. 2018 Sep;12(9):1792-1798. doi: 10.1017/S1751731117003354. Epub 2017 Dec 22.

Using machine learning to realize genetic site screening and genomic prediction of productive traits in pigs.利用机器学习实现猪生产性状的遗传位点筛选和基因组预测。

FASEB J. 2023 Jun;37(6):e22961. doi: 10.1096/fj.202300245R.

Genomic prediction ability for feed efficiency traits using different models and pseudo-phenotypes under several validation strategies in Nelore cattle.应用不同模型和拟表型在几种验证策略下对尼洛拉牛饲料效率性状进行基因组预测能力。

Animal. 2021 Feb;15(2):100085. doi: 10.1016/j.animal.2020.100085. Epub 2020 Dec 24.

Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods.利用特征选择方法从电子健康记录中识别与预测酒精使用障碍相关的临床因素。

BMC Med Inform Decis Mak. 2022 Nov 23;22(1):304. doi: 10.1186/s12911-022-02051-w.

SNP annotation-based whole genomic prediction and selection: an application to feed efficiency and its component traits in pigs.基于单核苷酸多态性注释的全基因组预测与选择：在猪饲料效率及其组成性状中的应用

J Anim Sci. 2015 May;93(5):2056-63. doi: 10.2527/jas.2014-8640.

The Relative Power of Structural Genomic Variation versus SNPs in Explaining the Quantitative Trait Growth in the Marine Teleost .结构基因组变异与单核苷酸多态性在解释海洋硬骨鱼类数量性状生长中的相对作用

Genes (Basel). 2022 Jun 23;13(7):1129. doi: 10.3390/genes13071129.

A novel genomic selection method combining GBLUP and LASSO.一种结合GBLUP和LASSO的新型基因组选择方法。

Genetica. 2015 Jun;143(3):299-304. doi: 10.1007/s10709-015-9826-5. Epub 2015 Feb 6.

引用本文的文献

Unsupervised fake news detection on social media using hybrid Gaussian Mixture Model.使用混合高斯混合模型在社交媒体上进行无监督假新闻检测。

PLoS One. 2025 Aug 18;20(8):e0330421. doi: 10.1371/journal.pone.0330421. eCollection 2025.

Enhancing genomic prediction in with optimized SNP subset by leveraging gene ontology priors and bin-based combinatorial optimization.通过利用基因本体先验知识和基于bin的组合优化来优化单核苷酸多态性（SNP）子集，增强基因组预测。

Front Bioinform. 2025 Jun 18;5:1607119. doi: 10.3389/fbinf.2025.1607119. eCollection 2025.

Variable selection strategies for genomic prediction of growth and carcass related traits in experimental Nellore cattle herds under different selection criteria.不同选择标准下实验内洛尔牛群生长和胴体相关性状基因组预测的变量选择策略

Sci Rep. 2025 Jul 1;15(1):22266. doi: 10.1038/s41598-025-06949-z.

Integration of epigenomic and genomic data to predict residual feed intake and the feed conversion ratio in dairy sheep via machine learning algorithms.整合表观基因组和基因组数据，通过机器学习算法预测奶羊的剩余采食量和饲料转化率。

BMC Genomics. 2025 Mar 31;26(1):313. doi: 10.1186/s12864-025-11520-1.

Enhancing Genomic Prediction Accuracy of Reproduction Traits in Rongchang Pigs Through Machine Learning.通过机器学习提高荣昌猪繁殖性状的基因组预测准确性

Animals (Basel). 2025 Feb 12;15(4):525. doi: 10.3390/ani15040525.

From Prediction to Precision: Explainable AI-Driven Insights for Targeted Treatment in Equine Colic.从预测到精准：可解释人工智能驱动的马属动物急腹症靶向治疗见解

Animals (Basel). 2025 Jan 8;15(2):126. doi: 10.3390/ani15020126.

Integration of machine learning and genome-wide association study to explore the genomic prediction accuracy of agronomic trait in oats (Avena sativa L.).整合机器学习与全基因组关联研究以探究燕麦（Avena sativa L.）农艺性状的基因组预测准确性。

Plant Genome. 2025 Mar;18(1):e20549. doi: 10.1002/tpg2.20549.

Advancing Regulatory Genomics With Machine Learning.利用机器学习推动监管基因组学发展。

Bioinform Biol Insights. 2024 Dec 24;18:11779322241249562. doi: 10.1177/11779322241249562. eCollection 2024.

Machine Learning for the Genomic Prediction of Growth Traits in a Composite Beef Cattle Population.机器学习用于复合肉牛群体生长性状的基因组预测

Animals (Basel). 2024 Oct 18;14(20):3014. doi: 10.3390/ani14203014.

Combining genetic markers, on-farm information and infrared data for the in-line prediction of blood biomarkers of metabolic disorders in Holstein cattle.结合遗传标记、农场信息和红外数据对荷斯坦奶牛代谢紊乱的血液生物标志物进行在线预测。

J Anim Sci Biotechnol. 2024 Jun 9;15(1):83. doi: 10.1186/s40104-024-01042-3.

本文引用的文献

A Benchmarking Between Deep Learning, Support Vector Machine and Bayesian Threshold Best Linear Unbiased Prediction for Predicting Ordinal Traits in Plant Breeding.深度学习、支持向量机和贝叶斯阈值最佳线性无偏预测在植物育种中预测有序性状的基准比较

G3 (Bethesda). 2019 Feb 7;9(2):601-618. doi: 10.1534/g3.118.200998.

Evaluating feature-selection stability in next-generation proteomics.评估新一代蛋白质组学中特征选择的稳定性。

J Bioinform Comput Biol. 2016 Oct;14(5):1650029. doi: 10.1142/S0219720016500293. Epub 2016 Aug 3.

Genome-wide prediction using Bayesian additive regression trees.使用贝叶斯加法回归树进行全基因组预测。

Genet Sel Evol. 2016 Jun 10;48(1):42. doi: 10.1186/s12711-016-0219-8.

Application of high-dimensional feature selection: evaluation for genomic prediction in man.高维特征选择的应用：人类基因组预测评估

Sci Rep. 2015 May 19;5:10312. doi: 10.1038/srep10312.

Technical note: An R package for fitting Bayesian regularized neural networks with applications in animal breeding.技术说明：用于拟合贝叶斯正则化神经网络的 R 包及其在动物育种中的应用。

J Anim Sci. 2013 Aug;91(8):3522-31. doi: 10.2527/jas.2012-6162. Epub 2013 May 8.

Comparison between linear and non-parametric regression models for genome-enabled prediction in wheat.线性和非参数回归模型在小麦基因组预测中的比较。

G3 (Bethesda). 2012 Dec;2(12):1595-605. doi: 10.1534/g3.112.003665. Epub 2012 Dec 1.

Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheat.贝叶斯神经网络预测复杂的数量性状：以泽西牛和小麦为例的研究

BMC Genet. 2011 Oct 7;12:87. doi: 10.1186/1471-2156-12-87.

Application of support vector regression to genome-assisted prediction of quantitative traits.支持向量回归在全基因组辅助数量性状预测中的应用。

Theor Appl Genet. 2011 Nov;123(7):1065-74. doi: 10.1007/s00122-011-1648-y. Epub 2011 Jul 8.

Evaluating stability and comparing output of feature selectors that optimize feature subset cardinality.评估稳定性并比较优化特征子集大小的特征选择器的输出。

IEEE Trans Pattern Anal Mach Intell. 2010 Nov;32(11):1921-39. doi: 10.1109/TPAMI.2010.34.

Regularization Paths for Generalized Linear Models via Coordinate Descent.基于坐标下降法的广义线性模型正则化路径

J Stat Softw. 2010;33(1):1-22.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用机器学习对猪的剩余采食量进行基因组预测的预测模型的特征选择稳定性和准确性

Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献