当数据的世代数较少时，采用具有成熟和年轻算法的一步法基因组 REML 是否更具计算效率？

Is single-step genomic REML with the algorithm for proven and young more computationally efficient when less generations of data are present?

机构信息

Breeding Research Department, Bayer Crop Science, Uberlândia, Minas Gerais, Brazil.

Departamento de Zootecnia, Universidade Federal de Viçosa, Viçosa, Minas Gerais, Brazil.

出版信息

J Anim Sci. 2022 May 1;100(5). doi: 10.1093/jas/skac082.

DOI:10.1093/jas/skac082

PMID:35289906

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9118993/

Abstract

Efficient computing techniques allow the estimation of variance components for virtually any traditional dataset. When genomic information is available, variance components can be estimated using genomic REML (GREML). If only a portion of the animals have genotypes, single-step GREML (ssGREML) is the method of choice. The genomic relationship matrix (G) used in both cases is dense, limiting computations depending on the number of genotyped animals. The algorithm for proven and young (APY) can be used to create a sparse inverse of G (GAPY-1) with close to linear memory and computing requirements. In ssGREML, the inverse of the realized relationship matrix (H-1) also includes the inverse of the pedigree relationship matrix, which can be dense with a long pedigree, but sparser with short. The main purpose of this study was to investigate whether costs of ssGREML can be reduced using APY with truncated pedigree and phenotypes. We also investigated the impact of truncation on variance components estimation when different numbers of core animals are used in APY. Simulations included 150K animals from 10 generations, with selection. Phenotypes (h2 = 0.3) were available for all animals in generations 1-9. A total of 30K animals in generations 8 and 9, and 15K validation animals in generation 10 were genotyped for 52,890 SNP. Average information REML and ssGREML with G-1 and GAPY-1 using 1K, 5K, 9K, and 14K core animals were compared. Variance components are impacted when the core group in APY represents the number of eigenvalues explaining a small fraction of the total variation in G. The most time-consuming operation was the inversion of G, with more than 50% of the total time. Next, numerical factorization consumed nearly 30% of the total computing time. On average, a 7% decrease in the computing time for ordering was observed by removing each generation of data. APY can be successfully applied to create the inverse of the genomic relationship matrix used in ssGREML for estimating variance components. To ensure reliable variance component estimation, it is important to use a core size that corresponds to the number of largest eigenvalues explaining around 98% of total variation in G. When APY is used, pedigrees can be truncated to increase the sparsity of H and slightly reduce computing time for ordering and symbolic factorization, with no impact on the estimates.

摘要

高效的计算技术可以对几乎任何传统数据集进行方差分量估计。当获得基因组信息时，可以使用基因组重复最大似然法（GREML）进行方差分量估计。如果只有部分动物具有基因型，则选择单步 GREML（ssGREML）。这两种情况下使用的基因组关系矩阵（G）都是密集的，这取决于基因型动物的数量，这会限制计算。已经证明和年轻（APY）的算法可以用于创建接近线性内存和计算要求的 G 的稀疏逆（GAPY-1）。在 ssGREML 中，实现关系矩阵（H-1）的逆也包括系谱关系矩阵的逆，系谱关系矩阵在系谱较长时可能是密集的，但在较短时较稀疏。本研究的主要目的是研究使用具有截断系谱和表型的 APY 是否可以降低 ssGREML 的成本。我们还研究了当在 APY 中使用不同数量的核心动物时，截断对方差分量估计的影响。模拟包括 10 代的 15 万只动物，具有选择。第 1-9 代的所有动物都可获得表型（h2=0.3）。第 8 代和第 9 代共有 3 万只动物，第 10 代有 1.5 万只验证动物，用于 52890 个 SNP 的基因型。使用 1K、5K、9K 和 14K 核心动物比较了平均信息 REML 和使用 G-1 和 GAPY-1 的 ssGREML。当 APY 中的核心群体代表解释 G 中总变异的一小部分的特征值数量时，方差分量会受到影响。最耗时的操作是 G 的逆，占总时间的 50%以上。接下来，数值分解消耗了总计算时间的近 30%。平均而言，通过删除每一代数据，观察到排序计算时间减少了 7%。APY 可成功应用于创建用于估计方差分量的 ssGREML 中基因组关系矩阵的逆。为了确保可靠的方差分量估计，使用对应于解释 G 中总变异的约 98%的最大特征值数量的核心大小很重要。当使用 APY 时，可以截断系谱以增加 H 的稀疏性，并略微减少排序和符号分解的计算时间，而对估计没有影响。