Meuwissen Theo, van den Berg Irene, Goddard Mike
Norwegian University of Life Sciences, Box 5003, 1432, Ås, Norway.
Agriculture Victoria, Bundoora, Australia.
Genet Sel Evol. 2021 Feb 26;53(1):19. doi: 10.1186/s12711-021-00607-4.
Whole-genome sequence (WGS) data are increasingly available on large numbers of individuals in animal and plant breeding and in human genetics through second-generation resequencing technologies, 1000 genomes projects, and large-scale genotype imputation from lower marker densities. Here, we present a computationally fast implementation of a variable selection genomic prediction method, that could handle WGS data on more than 35,000 individuals, test its accuracy for across-breed predictions and assess its quantitative trait locus (QTL) mapping precision.
The Monte Carlo Markov chain (MCMC) variable selection model (Bayes GC) fits simultaneously a genomic best linear unbiased prediction (GBLUP) term, i.e. a polygenic effect whose correlations are described by a genomic relationship matrix (G), and a Bayes C term, i.e. a set of single nucleotide polymorphisms (SNPs) with large effects selected by the model. Computational speed is improved by a Metropolis-Hastings sampling that directs computations to the SNPs, which are, a priori, most likely to be included into the model. Speed is also improved by running many relatively short MCMC chains. Memory requirements are reduced by storing the genotype matrix in binary form. The model was tested on a WGS dataset containing Holstein, Jersey and Australian Red cattle. The data contained 4,809,520 genotypes on 35,549 individuals together with their milk, fat and protein yields, and fat and protein percentage traits.
The prediction accuracies of the Jersey individuals improved by 1.5% when using across-breed GBLUP compared to within-breed predictions. Using WGS instead of 600 k SNP-chip data yielded on average a 3% accuracy improvement for Australian Red cows. QTL were fine-mapped by locating the SNP with the highest posterior probability of being included in the model. Various QTL known from the literature were rediscovered, and a new SNP affecting milk production was discovered on chromosome 20 at 34.501126 Mb. Due to the high mapping precision, it was clear that many of the discovered QTL were the same across the five dairy traits.
Across-breed Bayes GC genomic prediction improved prediction accuracies compared to GBLUP. The combination of across-breed WGS data and Bayesian genomic prediction proved remarkably effective for the fine-mapping of QTL.
通过第二代重测序技术、千人基因组计划以及从低密度标记进行大规模基因型推算,全基因组序列(WGS)数据在动植物育种和人类遗传学领域越来越多地应用于大量个体。在此,我们展示了一种可变选择基因组预测方法的快速计算实现方式,该方法能够处理超过35000个个体的WGS数据,测试其跨品种预测的准确性,并评估其数量性状位点(QTL)定位精度。
蒙特卡罗马尔可夫链(MCMC)可变选择模型(贝叶斯GC)同时拟合一个基因组最佳线性无偏预测(GBLUP)项,即由基因组关系矩阵(G)描述相关性的多基因效应,以及一个贝叶斯C项,即由模型选择的一组具有大效应的单核苷酸多态性(SNP)。通过将计算导向先验上最有可能被纳入模型的SNP的Metropolis-Hastings抽样提高计算速度。通过运行许多相对较短的MCMC链也提高了速度。通过以二进制形式存储基因型矩阵降低内存需求。该模型在包含荷斯坦牛、泽西牛和澳大利亚红牛的WGS数据集上进行了测试。数据包含35549个个体的4809520个基因型以及它们的产奶量、乳脂产量、乳蛋白产量、乳脂率和乳蛋白率性状。
与品种内预测相比,使用跨品种GBLUP时泽西个体的预测准确性提高了1.5%。对于澳大利亚红牛,使用WGS而非600k SNP芯片数据平均使准确性提高了3%。通过定位模型中后验概率最高的SNP对QTL进行精细定位。重新发现了文献中已知的各种QTL,并在20号染色体上34.501126 Mb处发现了一个影响产奶量的新SNP。由于定位精度高,很明显在五个奶牛性状中发现的许多QTL是相同的。
与GBLUP相比