Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, MI 48824, USA.
Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA.
Genetics. 2021 May 17;218(1). doi: 10.1093/genetics/iyab030.
Genomic prediction uses DNA sequences and phenotypes to predict genetic values. In homogeneous populations, theory indicates that the accuracy of genomic prediction increases with sample size. However, differences in allele frequencies and linkage disequilibrium patterns can lead to heterogeneity in SNP effects. In this context, calibrating genomic predictions using a large, potentially heterogeneous, training data set may not lead to optimal prediction accuracy. Some studies tried to address this sample size/homogeneity trade-off using training set optimization algorithms; however, this approach assumes that a single training data set is optimum for all individuals in the prediction set. Here, we propose an approach that identifies, for each individual in the prediction set, a subset from the training data (i.e., a set of support points) from which predictions are derived. The methodology that we propose is a sparse selection index (SSI) that integrates selection index methodology with sparsity-inducing techniques commonly used for high-dimensional regression. The sparsity of the resulting index is controlled by a regularization parameter (λ); the G-Best Linear Unbiased Predictor (G-BLUP) (the prediction method most commonly used in plant and animal breeding) appears as a special case which happens when λ = 0. In this study, we present the methodology and demonstrate (using two wheat data sets with phenotypes collected in 10 different environments) that the SSI can achieve significant (anywhere between 5 and 10%) gains in prediction accuracy relative to the G-BLUP.
基因组预测利用 DNA 序列和表型来预测遗传值。在同质群体中,理论表明基因组预测的准确性随着样本量的增加而提高。然而,等位基因频率和连锁不平衡模式的差异会导致 SNP 效应的异质性。在这种情况下,使用大型、潜在异质的训练数据集校准基因组预测可能不会导致最佳的预测准确性。一些研究试图使用训练集优化算法来解决样本大小/同质性的权衡问题;然而,这种方法假设对于预测集中的所有个体,单个训练数据集是最优的。在这里,我们提出了一种方法,为预测集中的每个个体从训练数据中识别出一个子集(即一组支持点),从中得出预测。我们提出的方法是稀疏选择指数 (SSI),它将选择指数方法与常用于高维回归的稀疏诱导技术相结合。由此产生的索引的稀疏性由正则化参数 (λ) 控制;GBest 线性无偏预测器 (G-BLUP)(植物和动物育种中最常用的预测方法)是当 λ = 0 时出现的一个特殊情况。在这项研究中,我们介绍了该方法,并通过两个在 10 个不同环境中收集表型的小麦数据集进行了演示,证明了 SSI 可以在预测准确性方面相对于 G-BLUP 获得显著的(5%到 10%之间的任意增益)提高。