Department of Agronomy, National Taiwan University, Taipei, Taiwan.
Theor Appl Genet. 2019 Oct;132(10):2781-2792. doi: 10.1007/s00122-019-03387-0. Epub 2019 Jul 2.
A new optimality criterion is proposed to determine a training set for genomic selection, which is derived from Pearson's correlation between GEBVs and phenotypic values of a test set. R functions are provided to generate the optimal training set. For a specified test set, we develop a highly efficient algorithm to determine an optimal subset from a large candidate set in which the individuals have been genotyped but not phenotyped yet. The chosen subset serves as a training set to be phenotyped, and then a genomic selection (GS) model is built based on its phenotype and genotype data. In this study, we consider the additive effects whole-genome regression model and adopt ridge regression estimation for marker effects in the GS model. The resulting GS model is then employed to predict genomic estimated breeding values (GEBVs) for the individuals of the test set, which have been genotyped only. We propose a new optimality criterion to determine the required training set, which is derived directly from Pearson's correlation between GEBVs and phenotypic values of the test set. Pearson's correlation is the standard measure for prediction accuracy of a GS model. Our proposed methods can be applied to data with the varying degree of population structure. All the R functions for implementing our training set determination algorithms are available from the R package TSDFGS. The algorithms are illustrated with two datasets which have strong (rice genome dataset) and mild (wheat genome dataset) population structures. Our methods are shown to be advantageous over existing ones, mainly because they fully use the genomic relationship between the test set and the training set by taking into account both the variance and bias for predicting GEBVs.
提出了一种新的最优性准则来确定基因组选择的训练集,该准则源自测试集的 GEBV 和表型值之间的皮尔逊相关系数。提供了 R 函数来生成最优的训练集。对于指定的测试集,我们开发了一种高效的算法,从大量候选集中确定最佳子集,其中个体已经进行了基因型但尚未进行表型分析。选择的子集作为要进行表型分析的训练集,然后基于其表型和基因型数据构建基因组选择 (GS) 模型。在这项研究中,我们考虑了加性效应全基因组回归模型,并在 GS 模型中采用岭回归估计标记效应。然后,将所得的 GS 模型用于预测仅进行基因型分析的测试集个体的基因组估计育种值 (GEBV)。我们提出了一种新的最优性准则来确定所需的训练集,该准则直接源自测试集的 GEBV 和表型值之间的皮尔逊相关系数。皮尔逊相关系数是 GS 模型预测准确性的标准衡量指标。我们的方法可应用于具有不同群体结构程度的数据。用于实现我们的训练集确定算法的所有 R 函数都可从 R 包 TSDFGS 获得。使用具有强(水稻基因组数据集)和弱(小麦基因组数据集)群体结构的两个数据集说明了我们的方法。与现有方法相比,我们的方法具有优势,主要是因为它们通过考虑预测 GEBV 的方差和偏差,充分利用了测试集和训练集之间的基因组关系。