Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, MI, USA.
Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, USA.
Heredity (Edinb). 2021 Nov;127(5):423-432. doi: 10.1038/s41437-021-00474-1. Epub 2021 Sep 25.
Genomic prediction models are often calibrated using multi-generation data. Over time, as data accumulates, training data sets become increasingly heterogeneous. Differences in allele frequency and linkage disequilibrium patterns between the training and prediction genotypes may limit prediction accuracy. This leads to the question of whether all available data or a subset of it should be used to calibrate genomic prediction models. Previous research on training set optimization has focused on identifying a subset of the available data that is optimal for a given prediction set. However, this approach does not contemplate the possibility that different training sets may be optimal for different prediction genotypes. To address this problem, we recently introduced a sparse selection index (SSI) that identifies an optimal training set for each individual in a prediction set. Using additive genomic relationships, the SSI can provide increased accuracy relative to genomic-BLUP (GBLUP). Non-parametric genomic models using Gaussian kernels (KBLUP) have, in some cases, yielded higher prediction accuracies than standard additive models. Therefore, here we studied whether combining SSIs and kernel methods could further improve prediction accuracy when training genomic models using multi-generation data. Using four years of doubled haploid maize data from the International Maize and Wheat Improvement Center (CIMMYT), we found that when predicting grain yield the KBLUP outperformed the GBLUP, and that using SSI with additive relationships (GSSI) lead to 5-17% increases in accuracy, relative to the GBLUP. However, differences in prediction accuracy between the KBLUP and the kernel-based SSI were smaller and not always significant.
基因组预测模型通常使用多代数据进行校准。随着时间的推移,随着数据的积累,训练数据集变得越来越不均匀。训练和预测基因型之间等位基因频率和连锁不平衡模式的差异可能会限制预测准确性。这就提出了一个问题,即应该使用所有可用数据还是其中的一个子集来校准基因组预测模型。以前关于训练集优化的研究主要集中在确定给定预测集的最佳可用数据子集上。然而,这种方法并没有考虑到不同的训练集可能对不同的预测基因型是最优的。为了解决这个问题,我们最近引入了一种稀疏选择指数(SSI),它可以为预测集中的每个个体确定最佳的训练集。使用加性基因组关系,SSI 可以相对于基因组-BLUP(GBLUP)提供更高的准确性。使用高斯核(KBLUP)的非参数基因组模型在某些情况下产生的预测准确性高于标准加性模型。因此,在这里,我们研究了在使用多代数据训练基因组模型时,结合 SSI 和核方法是否可以进一步提高预测准确性。我们使用来自国际玉米小麦改良中心(CIMMYT)的四年双倍单倍体玉米数据,发现当预测谷物产量时,KBLUP 优于 GBLUP,而使用加性关系的 SSI(GSSI)相对于 GBLUP 可将准确性提高 5-17%。然而,KBLUP 和基于核的 SSI 之间的预测准确性差异较小,并不总是显著。