通过交叉验证比较基因组预测模型

Comparing Genomic Prediction Models by Means of Cross Validation.

作者信息

Schrauf Matías F, de Los Campos Gustavo, Munilla Sebastián

机构信息

Facultad de Agronomía, Universidad de Buenos Aires, Buenos Aires, Argentina.

Animal Breeding & Genomics, Wageningen Livestock Research, Wageningen University & Research, Wageningen, Netherlands.

出版信息

Front Plant Sci. 2021 Nov 19;12:734512. doi: 10.3389/fpls.2021.734512. eCollection 2021.

DOI:10.3389/fpls.2021.734512

PMID:34868117

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8639521/

Abstract

In the two decades of continuous development of genomic selection, a great variety of models have been proposed to make predictions from the information available in dense marker panels. Besides deciding which particular model to use, practitioners also need to make many minor choices for those parameters in the model which are not typically estimated by the data (so called "hyper-parameters"). When the focus is placed on predictions, most of these decisions are made in a direction sought to optimize predictive accuracy. Here we discuss and illustrate using publicly available crop datasets the use of cross validation to make many such decisions. In particular, we emphasize the importance of paired comparisons to achieve high power in the comparison between candidate models, as well as the need to define notions of relevance in the difference between their performances. Regarding the latter, we borrow the idea of equivalence margins from clinical research and introduce new statistical tests. We conclude that most hyper-parameters can be learnt from the data by either minimizing REML or by using weakly-informative priors, with good predictive results. In particular, the default options in a popular software are generally competitive with the optimal values. With regard to the performance assessments themselves, we conclude that the paired k-fold cross validation is a generally applicable and statistically powerful methodology to assess differences in model accuracies. Coupled with the definition of equivalence margins based on expected genetic gain, it becomes a useful tool for breeders.

摘要

在基因组选择持续发展的二十年里，人们提出了各种各样的模型，以便根据高密度标记面板中的可用信息进行预测。除了决定使用哪种特定模型外，从业者还需要对模型中那些通常不由数据估计的参数（即所谓的“超参数”）做出许多细微的选择。当重点放在预测上时，大多数这些决策都是朝着优化预测准确性的方向做出的。在这里，我们使用公开可用的作物数据集来讨论和说明如何使用交叉验证来做出许多此类决策。特别是，我们强调了配对比较在候选模型比较中实现高功效的重要性，以及在它们性能差异中定义相关性概念的必要性。关于后者，我们借鉴临床研究中的等效界值概念，引入新的统计检验。我们得出结论，大多数超参数可以通过最小化REML或使用弱信息先验从数据中学习，从而获得良好的预测结果。特别是流行软件中的默认选项通常与最优值具有竞争力。关于性能评估本身，我们得出结论，配对k折交叉验证是一种普遍适用且具有统计效力的方法，用于评估模型准确性的差异。再结合基于预期遗传增益的等效界值定义，它就成为育种者的一个有用工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ce0/8639521/cdf794b7afe0/fpls-12-734512-g0001.jpg

相似文献

Comparing Genomic Prediction Models by Means of Cross Validation.通过交叉验证比较基因组预测模型

Front Plant Sci. 2021 Nov 19;12:734512. doi: 10.3389/fpls.2021.734512. eCollection 2021.

How Population Structure Impacts Genomic Selection Accuracy in Cross-Validation: Implications for Practical Breeding.群体结构如何影响交叉验证中的基因组选择准确性：对实际育种的启示

Front Plant Sci. 2020 Dec 16;11:592977. doi: 10.3389/fpls.2020.592977. eCollection 2020.

Genomic Studies Reveal Substantial Dominant Effects and Improved Genomic Predictions in an Open-Pollinated Breeding Population of .基因组研究揭示了. 的开放授粉育种群体中大量的显性效应和改进的基因组预测。

G3 (Bethesda). 2020 Oct 5;10(10):3751-3763. doi: 10.1534/g3.120.401601.

Assessment of the genomic prediction accuracy for feed efficiency traits in meat-type chickens.肉用型鸡饲料效率性状的基因组预测准确性评估。

PLoS One. 2017 Mar 9;12(3):e0173620. doi: 10.1371/journal.pone.0173620. eCollection 2017.

Prediction of genomic breeding values for growth, carcass and meat quality traits in a multi-breed sheep population using a HD SNP chip.利用高密度单核苷酸多态性（SNP）芯片预测多品种绵羊群体生长、胴体和肉质性状的基因组育种值。

BMC Genet. 2017 Jan 26;18(1):7. doi: 10.1186/s12863-017-0476-8.

Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking.动植物基因组预测：数据模拟、验证、报告和基准测试。

Genetics. 2013 Feb;193(2):347-65. doi: 10.1534/genetics.112.147983. Epub 2012 Dec 5.

Accuracies of genomic breeding values in American Angus beef cattle using K-means clustering for cross-validation.利用 K-均值聚类进行交叉验证评估美国安格斯肉牛基因组育种值的准确性。

Genet Sel Evol. 2011 Nov 28;43(1):40. doi: 10.1186/1297-9686-43-40.

Using markers with large effect in genetic and genomic predictions.在遗传和基因组预测中使用具有大效应的标记。

J Anim Sci. 2017 Jan;95(1):59-71. doi: 10.2527/jas.2016.0754.

Comparison of Genomic Selection Models to Predict Flowering Time and Spike Grain Number in Two Hexaploid Wheat Doubled Haploid Populations.预测两个六倍体小麦双单倍体群体开花时间和穗粒数的基因组选择模型比较

G3 (Bethesda). 2015 Jul 22;5(10):1991-8. doi: 10.1534/g3.115.019745.

Genomic prediction for tick resistance in Braford and Hereford cattle.布拉福德牛和赫里福德牛蜱抗性的基因组预测

J Anim Sci. 2015 Jun;93(6):2693-705. doi: 10.2527/jas.2014-8832.

引用本文的文献

Multi-trait ridge regression BLUP with GWAS improves genomic prediction for haploid induction ability of haploid inducers in maize.结合全基因组关联研究（GWAS）的多性状岭回归最佳线性无偏预测（BLUP）方法可提高对玉米单倍体诱导系单倍体诱导能力的基因组预测。

Front Plant Sci. 2025 Aug 19;16:1614457. doi: 10.3389/fpls.2025.1614457. eCollection 2025.

Improving genomic prediction accuracy for methane emission and feed efficiency in sheep: integrating rumen microbial PCA with host genomic variation using neural network GBLUP (NN-GBLUP).提高绵羊甲烷排放和饲料效率的基因组预测准确性：使用神经网络GBLUP（NN-GBLUP）将瘤胃微生物主成分分析与宿主基因组变异相结合。

Genet Sel Evol. 2025 Jul 17;57(1):41. doi: 10.1186/s12711-025-00987-x.

WheatGP, a genomic prediction method based on CNN and LSTM.WheatGP，一种基于卷积神经网络（CNN）和长短期记忆网络（LSTM）的基因组预测方法。

Brief Bioinform. 2025 Mar 4;26(2). doi: 10.1093/bib/bbaf191.

Environmental genomic selection to leverage polygenic local adaptation in barley landraces.利用环境基因组选择来挖掘大麦地方品种中的多基因局部适应性。

Commun Biol. 2025 Apr 16;8(1):618. doi: 10.1038/s42003-025-08045-4.

Comparison of linear mixed models for genetic feather score analysis in laying hens kept in recurrent testing facilities.在反复测试设施中饲养的蛋鸡遗传羽分分析的线性混合模型比较

Poult Sci. 2025 Feb;104(2):104833. doi: 10.1016/j.psj.2025.104833. Epub 2025 Jan 20.

On the ability of the LR method to detect bias when there is pedigree misspecification and lack of connectedness.当存在家系误判和不连通时，LR 方法检测偏差的能力。

Genet Sel Evol. 2024 Nov 21;56(1):74. doi: 10.1186/s12711-024-00943-1.

Stacked generalization as a computational method for the genomic selection.堆叠泛化作为基因组选择的一种计算方法。

Front Genet. 2024 Jul 10;15:1401470. doi: 10.3389/fgene.2024.1401470. eCollection 2024.

Implications of accounting for marker-based population structure in the quantitative genetic evaluation of genetic parameters related to growth and wood properties in Norway spruce.在挪威云杉生长和木材性质相关遗传参数的遗传评估中，考虑基于标记的群体结构的影响。

BMC Genom Data. 2024 Jun 14;25(1):60. doi: 10.1186/s12863-024-01241-x.

Ensemble learning for integrative prediction of genetic values with genomic variants.基于基因组变异的遗传值综合预测的集成学习。

BMC Bioinformatics. 2024 Mar 21;25(1):120. doi: 10.1186/s12859-024-05720-x.

Predicting risk of obesity in overweight adults using interpretable machine learning algorithms.使用可解释的机器学习算法预测超重成年人的肥胖风险。

Front Endocrinol (Lausanne). 2023 Nov 17;14:1292167. doi: 10.3389/fendo.2023.1292167. eCollection 2023.

本文引用的文献

Phantom Epistasis in Genomic Selection: On the Predictive Ability of Epistatic Models.基因组选择中的幽灵上位性：上位性模型的预测能力研究

G3 (Bethesda). 2020 Sep 2;10(9):3137-3145. doi: 10.1534/g3.120.401300.

Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits.基于参数和机器学习模型的复杂性状基因组预测的基准测试。

G3 (Bethesda). 2019 Nov 5;9(11):3691-3702. doi: 10.1534/g3.119.400498.

Pitfalls and Remedies for Cross Validation with Multi-trait Genomic Prediction Methods.多性状基因组预测方法中交叉验证的陷阱与补救措施。

G3 (Bethesda). 2019 Nov 5;9(11):3727-3741. doi: 10.1534/g3.119.400598.

Desert island papers-A life in variance parameter and quantitative genetic parameter estimation reviewed using 16 papers.荒岛文献回顾——使用 16 篇论文综述了方差参数和数量遗传参数估计的一生

J Anim Breed Genet. 2019 Jul;136(4):230-242. doi: 10.1111/jbg.12400.

Bayesian analysis and prediction of hybrid performance.杂种性能的贝叶斯分析与预测

Plant Methods. 2019 Feb 7;15:14. doi: 10.1186/s13007-019-0388-x. eCollection 2019.

Genomic variation in 3,010 diverse accessions of Asian cultivated rice.亚洲栽培稻 3010 份种质资源的基因组变异。

Nature. 2018 May;557(7703):43-49. doi: 10.1038/s41586-018-0063-9. Epub 2018 Apr 25.

Genomic prediction with epistasis models: on the marker-coding-dependent performance of the extended GBLUP and properties of the categorical epistasis model (CE).基于上位性模型的基因组预测：关于扩展GBLUP的标记编码依赖性性能及分类上位性模型（CE）的性质

BMC Bioinformatics. 2017 Jan 3;18(1):3. doi: 10.1186/s12859-016-1439-1.

Epistasis and covariance: how gene interaction translates into genomic relationship.上位性和协方差：基因互作如何转化为基因组关系。

Theor Appl Genet. 2016 May;129(5):963-76. doi: 10.1007/s00122-016-2675-5. Epub 2016 Feb 16.

Accounting for genetic architecture improves sequence based genomic prediction for a Drosophila fitness trait.考虑遗传结构可改善基于序列的果蝇适应性性状基因组预测。

PLoS One. 2015 May 7;10(5):e0126880. doi: 10.1371/journal.pone.0126880. eCollection 2015.

Genome-wide regression and prediction with the BGLR statistical package.使用BGLR统计软件包进行全基因组回归与预测。

Genetics. 2014 Oct;198(2):483-95. doi: 10.1534/genetics.114.164442. Epub 2014 Jul 9.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过交叉验证比较基因组预测模型

Comparing Genomic Prediction Models by Means of Cross Validation.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献