Hendrix Genetics B.V., Research and Technology Center (RTC), 5830 AC Boxmeer, The Netherlands.
The Jackson Laboratory, Bar Harbor, ME 04609, USA.
G3 (Bethesda). 2022 Apr 4;12(4). doi: 10.1093/g3journal/jkac039.
We compared the performance of linear (GBLUP, BayesB, and elastic net) methods to a nonparametric tree-based ensemble (gradient boosting machine) method for genomic prediction of complex traits in mice. The dataset used contained genotypes for 50,112 SNP markers and phenotypes for 835 animals from 6 generations. Traits analyzed were bone mineral density, body weight at 10, 15, and 20 weeks, fat percentage, circulating cholesterol, glucose, insulin, triglycerides, and urine creatinine. The youngest generation was used as a validation subset, and predictions were based on all older generations. Model performance was evaluated by comparing predictions for animals in the validation subset against their adjusted phenotypes. Linear models outperformed gradient boosting machine for 7 out of 10 traits. For bone mineral density, cholesterol, and glucose, the gradient boosting machine model showed better prediction accuracy and lower relative root mean squared error than the linear models. Interestingly, for these 3 traits, there is evidence of a relevant portion of phenotypic variance being explained by epistatic effects. Using a subset of top markers selected from a gradient boosting machine model helped for some of the traits to improve the accuracy of prediction when these were fitted into linear and gradient boosting machine models. Our results indicate that gradient boosting machine is more strongly affected by data size and decreased connectedness between reference and validation sets than the linear models. Although the linear models outperformed gradient boosting machine for the polygenic traits, our results suggest that gradient boosting machine is a competitive method to predict complex traits with assumed epistatic effects.
我们比较了线性方法(GBLUP、BayesB 和弹性网络)和基于非参数树的集成方法(梯度提升机)在预测小鼠复杂性状中的表现。使用的数据集包含 50112 个 SNP 标记的基因型和 6 个世代的 835 只动物的表型。分析的性状包括骨密度、10、15 和 20 周体重、体脂肪百分比、循环胆固醇、葡萄糖、胰岛素、甘油三酯和尿肌酐。最年轻的一代被用作验证子集,预测基于所有较老的世代。通过将验证子集中动物的预测与它们的调整表型进行比较来评估模型性能。线性模型在 10 个特征中的 7 个方面优于梯度提升机。对于骨密度、胆固醇和葡萄糖,梯度提升机模型显示出比线性模型更好的预测准确性和更低的相对均方根误差。有趣的是,对于这 3 个特征,有证据表明表型方差的一个相关部分是由上位效应解释的。使用从梯度提升机模型中选择的最佳标记子集的一部分,当这些标记被拟合到线性和梯度提升机模型中时,有助于提高一些性状的预测准确性。我们的结果表明,梯度提升机比线性模型更受数据大小和参考集与验证集之间连通性降低的影响。虽然线性模型在多基因性状上优于梯度提升机,但我们的结果表明,梯度提升机是一种具有假设上位效应的预测复杂性状的竞争方法。