Suppr超能文献

通过交叉验证比较基因组预测模型

Comparing Genomic Prediction Models by Means of Cross Validation.

作者信息

Schrauf Matías F, de Los Campos Gustavo, Munilla Sebastián

机构信息

Facultad de Agronomía, Universidad de Buenos Aires, Buenos Aires, Argentina.

Animal Breeding & Genomics, Wageningen Livestock Research, Wageningen University & Research, Wageningen, Netherlands.

出版信息

Front Plant Sci. 2021 Nov 19;12:734512. doi: 10.3389/fpls.2021.734512. eCollection 2021.

Abstract

In the two decades of continuous development of genomic selection, a great variety of models have been proposed to make predictions from the information available in dense marker panels. Besides deciding which particular model to use, practitioners also need to make many minor choices for those parameters in the model which are not typically estimated by the data (so called "hyper-parameters"). When the focus is placed on predictions, most of these decisions are made in a direction sought to optimize predictive accuracy. Here we discuss and illustrate using publicly available crop datasets the use of cross validation to make many such decisions. In particular, we emphasize the importance of paired comparisons to achieve high power in the comparison between candidate models, as well as the need to define notions of relevance in the difference between their performances. Regarding the latter, we borrow the idea of equivalence margins from clinical research and introduce new statistical tests. We conclude that most hyper-parameters can be learnt from the data by either minimizing REML or by using weakly-informative priors, with good predictive results. In particular, the default options in a popular software are generally competitive with the optimal values. With regard to the performance assessments themselves, we conclude that the paired k-fold cross validation is a generally applicable and statistically powerful methodology to assess differences in model accuracies. Coupled with the definition of equivalence margins based on expected genetic gain, it becomes a useful tool for breeders.

摘要

在基因组选择持续发展的二十年里,人们提出了各种各样的模型,以便根据高密度标记面板中的可用信息进行预测。除了决定使用哪种特定模型外,从业者还需要对模型中那些通常不由数据估计的参数(即所谓的“超参数”)做出许多细微的选择。当重点放在预测上时,大多数这些决策都是朝着优化预测准确性的方向做出的。在这里,我们使用公开可用的作物数据集来讨论和说明如何使用交叉验证来做出许多此类决策。特别是,我们强调了配对比较在候选模型比较中实现高功效的重要性,以及在它们性能差异中定义相关性概念的必要性。关于后者,我们借鉴临床研究中的等效界值概念,引入新的统计检验。我们得出结论,大多数超参数可以通过最小化REML或使用弱信息先验从数据中学习,从而获得良好的预测结果。特别是流行软件中的默认选项通常与最优值具有竞争力。关于性能评估本身,我们得出结论,配对k折交叉验证是一种普遍适用且具有统计效力的方法,用于评估模型准确性的差异。再结合基于预期遗传增益的等效界值定义,它就成为育种者的一个有用工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ce0/8639521/cdf794b7afe0/fpls-12-734512-g0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验