Institute of Plant Breeding, Seed Science, and Population Genetics, University of Hohenheim, 70593 Stuttgart, Germany.
Genetics. 2013 Jun;194(2):493-503. doi: 10.1534/genetics.113.150227. Epub 2013 Mar 27.
Intense structuring of plant breeding populations challenges the design of the training set (TS) in genomic selection (GS). An important open question is how the TS should be constructed from multiple related or unrelated small biparental families to predict progeny from individual crosses. Here, we used a set of five interconnected maize (Zea mays L.) populations of doubled-haploid (DH) lines derived from four parents to systematically investigate how the composition of the TS affects the prediction accuracy for lines from individual crosses. A total of 635 DH lines genotyped with 16,741 polymorphic SNPs were evaluated for five traits including Gibberella ear rot severity and three kernel yield component traits. The populations showed a genomic similarity pattern, which reflects the crossing scheme with a clear separation of full sibs, half sibs, and unrelated groups. Prediction accuracies within full-sib families of DH lines followed closely theoretical expectations, accounting for the influence of sample size and heritability of the trait. Prediction accuracies declined by 42% if full-sib DH lines were replaced by half-sib DH lines, but statistically significantly better results could be achieved if half-sib DH lines were available from both instead of only one parent of the validation population. Once both parents of the validation population were represented in the TS, including more crosses with a constant TS size did not increase accuracies. Unrelated crosses showing opposite linkage phases with the validation population resulted in negative or reduced prediction accuracies, if used alone or in combination with related families, respectively. We suggest identifying and excluding such crosses from the TS. Moreover, the observed variability among populations and traits suggests that these uncertainties must be taken into account in models optimizing the allocation of resources in GS.
植物育种群体的强烈结构对基因组选择 (GS) 的训练集 (TS) 的设计提出了挑战。一个重要的开放性问题是,应该如何从多个相关或不相关的小双亲家系中构建 TS,以预测来自个体杂交的后代。在这里,我们使用了一组由四个亲本衍生的五个相互关联的玉米 (Zea mays L.) 双单倍体 (DH) 群体,系统地研究了 TS 的组成如何影响个体杂交后代的预测准确性。总共评估了 635 个 DH 系,这些系用 16741 个多态 SNP 进行了基因型分析,用于评估包括赤霉病穗腐严重程度和三个穗粒产量组成性状在内的五个性状。这些群体表现出基因组相似性模式,反映了与全同胞、半同胞和无关群体明确分离的杂交方案。DH 系全同胞家系内的预测准确性密切符合理论预期,这反映了样本大小和性状遗传力的影响。如果用半同胞 DH 系代替全同胞 DH 系,预测准确性下降了 42%,但如果验证群体的两个亲本都可以提供半同胞 DH 系,而不仅仅是一个亲本,那么可以获得统计学上更好的结果。一旦验证群体的两个亲本都包含在 TS 中,包括用恒定的 TS 大小增加更多的杂交,都不会提高准确性。与验证群体呈相反连锁相的无关杂交,如果单独使用或与相关家系结合使用,会导致预测准确性为负或降低。我们建议从 TS 中识别并排除这些杂交。此外,观察到的群体间和性状间的可变性表明,在优化 GS 中资源分配的模型中,必须考虑到这些不确定性。