González-Recio Oscar, Weigel Kent A, Gianola Daniel, Naya Hugo, Rosa Guilherme J M
Departamento de Mejora Genética Animal, Instituto Nacional de Investigaciones Agrarias, Madrid 28040, Spain.
Genet Res (Camb). 2010 Jun;92(3):227-37. doi: 10.1017/S0016672310000261.
The L(2)-Boosting algorithm is one of the most promising machine-learning techniques that has appeared in recent decades. It may be applied to high-dimensional problems such as whole-genome studies, and it is relatively simple from a computational point of view. In this study, we used this algorithm in a genomic selection context to make predictions of yet to be observed outcomes. Two data sets were used: (1) productive lifetime predicted transmitting abilities from 4702 Holstein sires genotyped for 32 611 single nucleotide polymorphisms (SNPs) derived from the Illumina BovineSNP50 BeadChip, and (2) progeny averages of food conversion rate, pre-corrected by environmental and mate effects, in 394 broilers genotyped for 3481 SNPs. Each of these data sets was split into training and testing sets, the latter comprising dairy or broiler sires whose ancestors were in the training set. Two weak learners, ordinary least squares (OLS) and non-parametric (NP) regression were used for the L2-Boosting algorithm, to provide a stringent evaluation of the procedure. This algorithm was compared with BL [Bayesian LASSO (least absolute shrinkage and selection operator)] and BayesA regression. Learning tasks were carried out in the training set, whereas validation of the models was performed in the testing set. Pearson correlations between predicted and observed responses in the dairy cattle (broiler) data set were 0.65 (0.33), 0.53 (0.37), 0.66 (0.26) and 0.63 (0.27) for OLS-Boosting, NP-Boosting, BL and BayesA, respectively. The smallest bias and mean-squared errors (MSEs) were obtained with OLS-Boosting in both the dairy cattle (0.08 and 1.08, respectively) and broiler (-0.011 and 0.006) data sets, respectively. In the dairy cattle data set, the BL was more accurate (bias=0.10 and MSE=1.10) than BayesA (bias=1.26 and MSE=2.81), whereas no differences between these two methods were found in the broiler data set. L2-Boosting with a suitable learner was found to be a competitive alternative for genomic selection applications, providing high accuracy and low bias in genomic-assisted evaluations with a relatively short computational time.
L(2)-Boosting算法是近几十年来出现的最有前景的机器学习技术之一。它可应用于全基因组研究等高维问题,并且从计算角度来看相对简单。在本研究中,我们在基因组选择背景下使用该算法对尚未观察到的结果进行预测。使用了两个数据集:(1) 4702头荷斯坦公牛的生产寿命预测传递能力,这些公牛针对来自Illumina BovineSNP50 BeadChip的32611个单核苷酸多态性(SNP)进行了基因分型;(2) 394只肉鸡的食物转化率后代平均值,该平均值经环境和配偶效应预校正,这些肉鸡针对3481个SNP进行了基因分型。每个数据集都被分为训练集和测试集,测试集包含其祖先在训练集中的奶牛或肉鸡公牛。L2-Boosting算法使用了两个弱学习器,即普通最小二乘法(OLS)和非参数(NP)回归,以对该过程进行严格评估。将该算法与贝叶斯LASSO(最小绝对收缩和选择算子)(BL)和贝叶斯A回归进行了比较。在训练集中执行学习任务,而在测试集中对模型进行验证。在奶牛(肉鸡)数据集中,OLS-Boosting、NP-Boosting、BL和贝叶斯A的预测响应与观察响应之间的皮尔逊相关系数分别为0.65(0.33)、0.53(0.37)、0.66(0.26)和0.63(0.27)。在奶牛(分别为0.08和1.08)和肉鸡(-0.011和0.006)数据集中,OLS-Boosting获得的偏差和均方误差(MSE)最小。在奶牛数据集中,BL(偏差=0.10,MSE=1.10)比贝叶斯A(偏差=1.26,MSE=2.81)更准确,而在肉鸡数据集中未发现这两种方法之间存在差异。发现使用合适学习器的L2-Boosting是基因组选择应用的一种有竞争力的替代方法,在基因组辅助评估中提供高精度和低偏差且计算时间相对较短。