Ogutu Joseph O, Piepho Hans-Peter, Schulz-Streeck Torben
Bioinformatics Unit, Institute of Crop Science, University of Hohenheim, Fruwirthstrasse 23, 70599 Stuttgart, Germany.
BMC Proc. 2011 May 27;5 Suppl 3(Suppl 3):S11. doi: 10.1186/1753-6561-5-S3-S11.
Genomic selection (GS) involves estimating breeding values using molecular markers spanning the entire genome. Accurate prediction of genomic breeding values (GEBVs) presents a central challenge to contemporary plant and animal breeders. The existence of a wide array of marker-based approaches for predicting breeding values makes it essential to evaluate and compare their relative predictive performances to identify approaches able to accurately predict breeding values. We evaluated the predictive accuracy of random forests (RF), stochastic gradient boosting (boosting) and support vector machines (SVMs) for predicting genomic breeding values using dense SNP markers and explored the utility of RF for ranking the predictive importance of markers for pre-screening markers or discovering chromosomal locations of QTLs.
We predicted GEBVs for one quantitative trait in a dataset simulated for the QTLMAS 2010 workshop. Predictive accuracy was measured as the Pearson correlation between GEBVs and observed values using 5-fold cross-validation and between predicted and true breeding values. The importance of each marker was ranked using RF and plotted against the position of the marker and associated QTLs on one of five simulated chromosomes.
The correlations between the predicted and true breeding values were 0.547 for boosting, 0.497 for SVMs, and 0.483 for RF, indicating better performance for boosting than for SVMs and RF.
Accuracy was highest for boosting, intermediate for SVMs and lowest for RF but differed little among the three methods and relative to ridge regression BLUP (RR-BLUP).
基因组选择(GS)涉及使用覆盖整个基因组的分子标记来估计育种值。准确预测基因组育种值(GEBV)是当代动植物育种者面临的核心挑战。存在大量基于标记的预测育种值的方法,因此评估和比较它们的相对预测性能以确定能够准确预测育种值的方法至关重要。我们使用密集的单核苷酸多态性(SNP)标记评估了随机森林(RF)、随机梯度提升(boosting)和支持向量机(SVM)预测基因组育种值的预测准确性,并探讨了RF在对标记的预测重要性进行排名以用于预筛选标记或发现数量性状基因座(QTL)的染色体位置方面的效用。
我们在为QTLMAS 2010研讨会模拟的数据集中预测了一个数量性状的GEBV。使用5折交叉验证,以GEBV与观测值之间以及预测育种值与真实育种值之间的皮尔逊相关系数来衡量预测准确性。使用RF对每个标记的重要性进行排名,并针对五个模拟染色体之一上的标记位置和相关QTL进行绘图。
boosting的预测育种值与真实育种值之间的相关系数为0.547,SVM为0.4并97,RF为0.483,表明boosting的性能优于SVM和RF。
boosting的准确性最高,SVM居中,RF最低,但这三种方法之间以及与岭回归最佳线性无偏预测(RR-BLUP)相比差异不大。