Department of Animal and Dairy Science, University of Georgia, Athens, GA, 30602, USA.
Genet Sel Evol. 2023 Jul 17;55(1):49. doi: 10.1186/s12711-023-00823-0.
Identifying true positive variants in genome-wide associations (GWA) depends on several factors, including the number of genotyped individuals. The limited dimensionality of genomic information may give insights into the optimal number of individuals to be used in GWA. This study investigated different discovery set sizes based on the number of largest eigenvalues explaining a certain proportion of variance in the genomic relationship matrix (G). In addition, we investigated the impact on the prediction accuracy by adding variants, which were selected based on different set sizes, to the regular single nucleotide polymorphism (SNP) chips used for genomic prediction.
We simulated sequence data that included 500k SNPs with 200 or 2000 quantitative trait nucleotides (QTN). A regular 50k panel included one in every ten simulated SNPs. Effective population size (Ne) was set to 20 or 200. GWA were performed using a number of genotyped animals equivalent to the number of largest eigenvalues of G (EIG) explaining 50, 60, 70, 80, 90, 95, 98, and 99% of the variance. In addition, the largest discovery set consisted of 30k genotyped animals. Limited or extensive phenotypic information was mimicked by changing the trait heritability. Significant and large-effect size SNPs were added to the 50k panel and used for single-step genomic best linear unbiased prediction (ssGBLUP).
Using a number of genotyped animals corresponding to at least EIG98 allowed the identification of QTN with the largest effect sizes when Ne was large. Populations with smaller Ne required more than EIG98. Furthermore, including genotyped animals with a higher reliability (i.e., a higher trait heritability) improved the identification of the most informative QTN. Prediction accuracy was highest when the significant or the large-effect SNPs representing twice the number of simulated QTN were added to the 50k panel.
Accurately identifying causative variants from sequence data depends on the effective population size and, therefore, on the dimensionality of genomic information. This dimensionality can help identify the most suitable sample size for GWA and could be considered for variant selection, especially when resources are restricted. Even when variants are accurately identified, their inclusion in prediction models has limited benefits.
在全基因组关联分析(GWA)中确定真正的阳性变体取决于多个因素,包括基因分型个体的数量。基因组信息的有限维度可能会深入了解 GWA 中使用的最佳个体数量。本研究基于解释基因组关系矩阵(G)中某个比例方差的最大特征值的数量,研究了不同的发现集大小。此外,我们还研究了通过向常规单核苷酸多态性(SNP)芯片中添加基于不同集大小选择的变体对预测准确性的影响,这些变体用于基因组预测。
我们模拟了包含 500k 个 SNP 和 200 或 2000 个数量性状核苷酸(QTN)的序列数据。常规的 50k 面板包含模拟 SNP 的每十个之一。有效群体大小(Ne)设定为 20 或 200。使用与 G 的最大特征值(EIG)数量相当的基因分型动物数量进行 GWA,这些特征值解释了方差的 50%、60%、70%、80%、90%、95%、98%和 99%。此外,最大的发现集由 30k 个基因分型动物组成。通过改变性状遗传力来模拟有限或广泛的表型信息。将显著和大效应大小的 SNP 添加到 50k 面板中,并用于一步基因组最佳线性无偏预测(ssGBLUP)。
当 Ne 较大时,使用至少对应于 EIG98 的基因分型动物数量可以识别具有最大效应大小的 QTN。Ne 较小的群体需要超过 EIG98。此外,包含可靠性更高(即性状遗传力更高)的基因分型动物可以提高对最具信息量的 QTN 的识别。当将代表模拟 QTN 两倍数量的显著或大效应 SNP 添加到 50k 面板中时,预测准确性最高。
从序列数据中准确识别因果变异取决于有效群体大小,因此取决于基因组信息的维度。这种维度可以帮助确定 GWA 最适合的样本量,并可用于变体选择,特别是在资源有限的情况下。即使准确地识别了变体,它们在预测模型中的包含也只有有限的益处。