快速并行采样贝叶斯回归模型进行全基因组预测。

Fast parallelized sampling of Bayesian regression models for whole-genome prediction.

机构信息

Department of Animal Science, University of California Davis, Davis, CA, 95616, USA.

Integrative Genetics and Genomics Graduate Group, University of California Davis, Davis, CA, 95616, USA.

出版信息

Genet Sel Evol. 2020 Mar 23;52(1):16. doi: 10.1186/s12711-020-00533-x.

DOI:10.1186/s12711-020-00533-x

PMID:32293243

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7087391/

Abstract

BACKGROUND

Bayesian regression models are widely used in genomic prediction, where the effects of all markers are estimated simultaneously by combining the information from the phenotypic data with priors for the marker effects and other parameters such as variance components or membership probabilities. Inferences from most Bayesian regression models are based on Markov chain Monte Carlo methods, where statistics are computed from a Markov chain constructed to have a stationary distribution that is equal to the posterior distribution of the unknown parameters. In practice, chains of tens of thousands steps are typically used in whole-genome Bayesian analyses, which is computationally intensive.

METHODS

In this paper, we propose a fast parallelized algorithm for Bayesian regression models called independent intensive Bayesian regression models (BayesXII, "X" stands for Bayesian alphabet methods and "II" stands for "parallel") and show how the sampling of each marker effect can be made independent of samples for other marker effects within each step of the chain. This is done by augmenting the marker covariate matrix by adding p (the number of markers) new rows such that columns of the augmented marker covariate matrix are orthogonal. Ideally, the computations at each step of the MCMC chain can be accelerated by k times, where k is the number of computer processors, up to p times, where p is the number of markers.

RESULTS

We demonstrate the BayesXII algorithm using the prior for BayesC[Formula: see text], a Bayesian variable selection regression method, which is applied to simulated data with 50,000 individuals and a medium-density marker panel ([Formula: see text] 50,000 markers). To reach about the same accuracy as the conventional samplers for BayesC[Formula: see text] required less than 30 min using the BayesXII algorithm on 24 nodes (computer used as a server) with 24 cores on each node. In this case, the BayesXII algorithm required one tenth of the computation time of conventional samplers for BayesC[Formula: see text]. Addressing the heavy computational burden associated with Bayesian methods by parallel computing will lead to greater use of these methods.

摘要

背景

贝叶斯回归模型在基因组预测中被广泛应用，在该方法中，通过将表型数据的信息与标记效应的先验信息以及其他参数（例如方差分量或隶属概率）相结合，同时估计所有标记的效应。大多数贝叶斯回归模型的推论都是基于马尔可夫链蒙特卡罗方法，其中统计信息是从构建的具有与未知参数后验分布相等的平稳分布的马尔可夫链中计算得出的。在实践中，全基因组贝叶斯分析通常使用数万步的链，这在计算上是密集的。

方法

在本文中，我们提出了一种称为独立密集贝叶斯回归模型（BayesXII，“X”代表贝叶斯字母方法，“II”代表“并行”）的快速并行化算法，用于贝叶斯回归模型，并展示了如何在链的每一步中使每个标记效应的采样独立于其他标记效应的样本。这是通过在标记协变量矩阵中添加 p（标记数量）个新行来实现的，从而使得扩充的标记协变量矩阵的列是正交的。理想情况下，可以将 MCMC 链的每一步的计算速度提高 k 倍，其中 k 是计算机处理器的数量，最多可以提高 p 倍，其中 p 是标记的数量。

结果

我们使用 BayesC[Formula: see text]的先验来演示 BayesXII 算法，这是一种贝叶斯变量选择回归方法，应用于具有 50,000 个人和中等密度标记面板（[Formula: see text]50,000 个标记）的模拟数据。在 24 个节点（用作服务器的计算机）上使用 24 核，使用 BayesXII 算法，达到与传统 BayesC[Formula: see text]采样器相同的精度所需的时间不到 30 分钟。在这种情况下，BayesXII 算法所需的计算时间是传统 BayesC[Formula: see text]采样器的十分之一。通过并行计算来解决与贝叶斯方法相关的繁重计算负担将导致这些方法得到更广泛的应用。