Gianola Daniel, Weigel Kent A, Krämer Nicole, Stella Alessandra, Schön Chris-Carolin
Department of Animal Sciences, University of Wisconsin-Madison, Madison, Wisconsin, United States of America; Department of Dairy Science, University of Wisconsin-Madison, Madison, Wisconsin, United States of America; Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America.
Department of Dairy Science, University of Wisconsin-Madison, Madison, Wisconsin, United States of America.
PLoS One. 2014 Apr 10;9(4):e91693. doi: 10.1371/journal.pone.0091693. eCollection 2014.
We examined whether or not the predictive ability of genomic best linear unbiased prediction (GBLUP) could be improved via a resampling method used in machine learning: bootstrap aggregating sampling ("bagging"). In theory, bagging can be useful when the predictor has large variance or when the number of markers is much larger than sample size, preventing effective regularization. After presenting a brief review of GBLUP, bagging was adapted to the context of GBLUP, both at the level of the genetic signal and of marker effects. The performance of bagging was evaluated with four simulated case studies including known or unknown quantitative trait loci, and an application was made to real data on grain yield in wheat planted in four environments. A metric aimed to quantify candidate-specific cross-validation uncertainty was proposed and assessed; as expected, model derived theoretical reliabilities bore no relationship with cross-validation accuracy. It was found that bagging can ameliorate predictive performance of GBLUP and make it more robust against over-fitting. Seemingly, 25-50 bootstrap samples was enough to attain reasonable predictions as well as stable measures of individual predictive mean squared errors.
我们研究了基因组最佳线性无偏预测(GBLUP)的预测能力是否可以通过机器学习中使用的重采样方法:自助聚合采样(“装袋法”)得到提高。理论上,当预测器具有较大方差或者标记数量远大于样本量从而阻碍有效正则化时,装袋法可能会有用。在简要回顾GBLUP之后,装袋法被应用于GBLUP的背景下,包括遗传信号层面和标记效应层面。通过四个模拟案例研究评估了装袋法的性能,这些案例研究包括已知或未知的数量性状位点,并将其应用于在四种环境下种植的小麦籽粒产量的实际数据。提出并评估了一个旨在量化候选特异性交叉验证不确定性的指标;正如预期的那样,模型推导的理论可靠性与交叉验证准确性无关。研究发现,装袋法可以改善GBLUP的预测性能,并使其对过拟合更具鲁棒性。似乎,25 - 50个自助样本足以获得合理的预测以及个体预测均方误差的稳定度量。