基于贝叶斯基因组的预测模型中对先验设定的敏感性。

Sensitivity to prior specification in Bayesian genome-based prediction models.

作者信息

Lehermeier Christina, Wimmer Valentin, Albrecht Theresa, Auinger Hans-Jürgen, Gianola Daniel, Schmid Volker J, Schön Chris-Carolin

机构信息

Plant Breeding, Technische Universität München, Emil-Ramann-Straße 4, 85354 Freising, Germany.

出版信息

Stat Appl Genet Mol Biol. 2013 Jun;12(3):375-91. doi: 10.1515/sagmb-2012-0042.

DOI:10.1515/sagmb-2012-0042

PMID:23629460

Abstract

Different statistical models have been proposed for maximizing prediction accuracy in genome-based prediction of breeding values in plant and animal breeding. However, little is known about the sensitivity of these models with respect to prior and hyperparameter specification, because comparisons of prediction performance are mainly based on a single set of hyperparameters. In this study, we focused on Bayesian prediction methods using a standard linear regression model with marker covariates coding additive effects at a large number of marker loci. By comparing different hyperparameter settings, we investigated the sensitivity of four methods frequently used in genome-based prediction (Bayesian Ridge, Bayesian Lasso, BayesA and BayesB) to specification of the prior distribution of marker effects. We used datasets simulated according to a typical maize breeding program differing in the number of markers and the number of simulated quantitative trait loci affecting the trait. Furthermore, we used an experimental maize dataset, comprising 698 doubled haploid lines, each genotyped with 56110 single nucleotide polymorphism markers and phenotyped as testcrosses for the two quantitative traits grain dry matter yield and grain dry matter content. The predictive ability of the different models was assessed by five-fold cross-validation. The extent of Bayesian learning was quantified by calculation of the Hellinger distance between the prior and posterior densities of marker effects. Our results indicate that similar predictive abilities can be achieved with all methods, but with BayesA and BayesB hyperparameter settings had a stronger effect on prediction performance than with the other two methods. Prediction performance of BayesA and BayesB suffered substantially from a non-optimal choice of hyperparameters.

摘要

在植物和动物育种中基于基因组的育种值预测方面，已经提出了不同的统计模型来最大化预测准确性。然而，对于这些模型相对于先验和超参数设定的敏感性知之甚少，因为预测性能的比较主要基于单一的超参数集。在本研究中，我们聚焦于使用标准线性回归模型的贝叶斯预测方法，该模型在大量标记位点对加性效应进行标记协变量编码。通过比较不同的超参数设置，我们研究了基于基因组预测中常用的四种方法（贝叶斯岭回归、贝叶斯套索回归、贝叶斯A和贝叶斯B）对标记效应先验分布设定的敏感性。我们使用了根据典型玉米育种计划模拟的数据集，这些数据集在标记数量和影响性状的模拟数量性状位点数量上有所不同。此外，我们使用了一个实验玉米数据集，该数据集包含698个双单倍体系，每个系用56110个单核苷酸多态性标记进行基因分型，并作为两个数量性状（籽粒干物质产量和籽粒干物质含量）的测交进行表型分析。通过五重交叉验证评估不同模型的预测能力。通过计算标记效应的先验密度和后验密度之间的海林格距离来量化贝叶斯学习的程度。我们的结果表明，所有方法都能实现相似的预测能力，但贝叶斯A和贝叶斯B的超参数设置对预测性能的影响比其他两种方法更强。贝叶斯A和贝叶斯B的预测性能因超参数选择不当而大幅受损。