稀疏核模型为多年小麦育种数据中的基因组预测提供了训练集设计的优化。

Sparse kernel models provide optimization of training set design for genomic prediction in multiyear wheat breeding data.

作者信息

Lopez-Cruz Marco, Dreisigacker Susanne, Crespo-Herrera Leonardo, Bentley Alison R, Singh Ravi, Poland Jesse, Shrestha Sandesh, Huerta-Espino Julio, Govindan Velu, Juliana Philomin, Mondal Suchismita, Pérez-Rodríguez Paulino, Crossa Jose

机构信息

Dep. of Epidemiology and Biostatistics, Michigan State Univ., East Lansing, MI, USA.

Global Wheat Program, International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Mexico.

出版信息

Plant Genome. 2022 Dec;15(4):e20254. doi: 10.1002/tpg2.20254. Epub 2022 Aug 31.

DOI:10.1002/tpg2.20254

PMID:36043341

Abstract

The success of genomic selection (GS) in breeding schemes relies on its ability to provide accurate predictions of unobserved lines at early stages. Multigeneration data provides opportunities to increase the training data size and thus, the likelihood of extracting useful information from ancestors to improve prediction accuracy. The genomic best linear unbiased predictions (GBLUPs) are performed by borrowing information through kinship relationships between individuals. Multigeneration data usually becomes heterogeneous with complex family relationship patterns that are increasingly entangled with each generation. Under these conditions, historical data may not be optimal for model training as the accuracy could be compromised. The sparse selection index (SSI) is a method for training set (TRN) optimization, in which training individuals provide predictions to some but not all predicted subjects. We added an additional trimming process to the original SSI (trimmed SSI) to remove less important training individuals for prediction. Using a large multigeneration (8 yr) wheat (Triticum aestivum L.) grain yield dataset (n = 68,836), we found increases in accuracy as more years are included in the TRN, with improvements of ∼0.05 in the GBLUP accuracy when using 5 yr of historical data relative to when using only 1 yr. The SSI method showed a small gain over the GBLUP accuracy but with an important reduction on the TRN size. These reduced TRNs were formed with a similar number of subjects from each training generation. Our results suggest that the SSI provides a more stable ranking of genotypes than the GBLUP as the TRN becomes larger.

摘要

基因组选择（GS）在育种计划中的成功依赖于其在早期阶段对未观察到的品系提供准确预测的能力。多代数据提供了增加训练数据量的机会，从而增加了从祖先中提取有用信息以提高预测准确性的可能性。基因组最佳线性无偏预测（GBLUP）是通过个体间的亲缘关系借用信息来进行的。多代数据通常会因复杂的家庭关系模式而变得异质化，且这种模式在每一代中越来越纠缠不清。在这些情况下，历史数据可能并非模型训练的最佳选择，因为准确性可能会受到影响。稀疏选择指数（SSI）是一种用于训练集（TRN）优化的方法，其中训练个体为部分而非全部预测对象提供预测。我们在原始SSI（修剪后的SSI）上增加了一个额外的修剪过程，以去除对预测不太重要的训练个体。使用一个大型的多代（8年）小麦（Triticum aestivum L.）产量数据集（n = 68,836），我们发现随着TRN中包含的年份增加，准确性会提高，相对于仅使用1年历史数据时，使用5年历史数据时GBLUP准确性提高了约0.05。SSI方法相对于GBLUP准确性有小幅提高，但TRN规模显著减小。这些缩小的TRN由每个训练代中数量相似的对象组成。我们的结果表明，随着TRN变大，与GBLUP相比，SSI能提供更稳定的基因型排名。