Centro de Biotecnologia y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnologia Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223, Madrid, Spain.
CIBMTR (Center for International Blood and Marrow Transplant Research), National Marrow Donor Program/Be The Match, Minneapolis, USA.
Theor Appl Genet. 2023 Mar 9;136(3):30. doi: 10.1007/s00122-023-04265-6.
Maximizing CDmean and Avg_GRM_self were the best criteria for training set optimization. A training set size of 50-55% (targeted) or 65-85% (untargeted) is needed to obtain 95% of the accuracy. With the advent of genomic selection (GS) as a widespread breeding tool, mechanisms to efficiently design an optimal training set for GS models became more relevant, since they allow maximizing the accuracy while minimizing the phenotyping costs. The literature described many training set optimization methods, but there is a lack of a comprehensive comparison among them. This work aimed to provide an extensive benchmark among optimization methods and optimal training set size by testing a wide range of them in seven datasets, six different species, different genetic architectures, population structure, heritabilities, and with several GS models to provide some guidelines about their application in breeding programs. Our results showed that targeted optimization (uses information from the test set) performed better than untargeted (does not use test set data), especially when heritability was low. The mean coefficient of determination was the best targeted method, although it was computationally intensive. Minimizing the average relationship within the training set was the best strategy for untargeted optimization. Regarding the optimal training set size, maximum accuracy was obtained when the training set was the entire candidate set. Nevertheless, a 50-55% of the candidate set was enough to reach 95-100% of the maximum accuracy in the targeted scenario, while we needed a 65-85% for untargeted optimization. Our results also suggested that a diverse training set makes GS robust against population structure, while including clustering information was less effective. The choice of the GS model did not have a significant influence on the prediction accuracies.
最大化 CDmean 和 Avg_GRM_self 是训练集优化的最佳标准。需要将训练集大小设置为 50-55%(目标)或 65-85%(非目标),才能达到 95%的准确性。随着基因组选择 (GS) 作为一种广泛的育种工具的出现,设计 GS 模型的最佳训练集的机制变得更加相关,因为它们可以在最小化表型成本的同时最大限度地提高准确性。文献中描述了许多训练集优化方法,但缺乏对它们的全面比较。本工作旨在通过在七个数据集、六个不同物种、不同遗传结构、群体结构、遗传力和几种 GS 模型中测试广泛的方法来提供广泛的优化方法和最佳训练集大小的基准,以提供一些关于其在育种计划中的应用的指导。我们的结果表明,目标优化(使用测试集的信息)比非目标优化(不使用测试集数据)表现更好,尤其是当遗传力较低时。平均确定系数是最佳的目标方法,尽管它计算量大。最小化训练集中的平均关系是非目标优化的最佳策略。关于最佳训练集大小,当训练集是整个候选集时,可以获得最大的准确性。然而,在目标场景中,只需候选集的 50-55% 即可达到最大准确性的 95-100%,而我们需要非目标优化的 65-85%。我们的结果还表明,多样化的训练集使 GS 对群体结构具有鲁棒性,而包含聚类信息的效果较差。GS 模型的选择对预测准确性没有显著影响。