Baele Guy, Lemey Philippe, Rambaut Andrew, Suchard Marc A
Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium.
Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, UK.
Bioinformatics. 2017 Jun 15;33(12):1798-1805. doi: 10.1093/bioinformatics/btx088.
Advances in sequencing technology continue to deliver increasingly large molecular sequence datasets that are often heavily partitioned in order to accurately model the underlying evolutionary processes. In phylogenetic analyses, partitioning strategies involve estimating conditionally independent models of molecular evolution for different genes and different positions within those genes, requiring a large number of evolutionary parameters that have to be estimated, leading to an increased computational burden for such analyses. The past two decades have also seen the rise of multi-core processors, both in the central processing unit (CPU) and Graphics processing unit processor markets, enabling massively parallel computations that are not yet fully exploited by many software packages for multipartite analyses.
We here propose a Markov chain Monte Carlo (MCMC) approach using an adaptive multivariate transition kernel to estimate in parallel a large number of parameters, split across partitioned data, by exploiting multi-core processing. Across several real-world examples, we demonstrate that our approach enables the estimation of these multipartite parameters more efficiently than standard approaches that typically use a mixture of univariate transition kernels. In one case, when estimating the relative rate parameter of the non-coding partition in a heterochronous dataset, MCMC integration efficiency improves by > 14-fold.
Our implementation is part of the BEAST code base, a widely used open source software package to perform Bayesian phylogenetic inference.
Supplementary data are available at Bioinformatics online.
测序技术的进步不断产生越来越大的分子序列数据集,为了准确模拟潜在的进化过程,这些数据集通常被严重划分。在系统发育分析中,划分策略涉及为不同基因以及这些基因内的不同位置估计分子进化的条件独立模型,这需要估计大量的进化参数,从而增加了此类分析的计算负担。在过去二十年中,无论是在中央处理器(CPU)还是图形处理器市场,多核处理器都有所兴起,这使得大规模并行计算成为可能,但许多用于多部分分析的软件包尚未充分利用这一点。
我们在此提出一种马尔可夫链蒙特卡罗(MCMC)方法,该方法使用自适应多元转移核,通过利用多核处理来并行估计大量参数,这些参数分布在划分的数据中。通过几个实际例子,我们证明我们的方法比通常使用单变量转移核混合的标准方法更有效地估计这些多部分参数。在一个案例中,当估计异时数据集中非编码分区的相对速率参数时,MCMC积分效率提高了14倍以上。
我们的实现是BEAST代码库的一部分,BEAST是一个广泛使用的用于执行贝叶斯系统发育推断的开源软件包。
补充数据可在《生物信息学》在线获取。