Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA.
BMC Bioinformatics. 2013 Jan 19;14:25. doi: 10.1186/1471-2105-14-25.
Maximum Likelihood (ML)-based phylogenetic inference using Felsenstein's pruning algorithm is a standard method for estimating the evolutionary relationships amongst a set of species based on DNA sequence data, and is used in popular applications such as RAxML, PHYLIP, GARLI, BEAST, and MrBayes. The Phylogenetic Likelihood Function (PLF) and its associated scaling and normalization steps comprise the computational kernel for these tools. These computations are data intensive but contain fine grain parallelism that can be exploited by coprocessor architectures such as FPGAs and GPUs. A general purpose API called BEAGLE has recently been developed that includes optimized implementations of Felsenstein's pruning algorithm for various data parallel architectures. In this paper, we extend the BEAGLE API to a multiple Field Programmable Gate Array (FPGA)-based platform called the Convey HC-1.
The core calculation of our implementation, which includes both the phylogenetic likelihood function (PLF) and the tree likelihood calculation, has an arithmetic intensity of 130 floating-point operations per 64 bytes of I/O, or 2.03 ops/byte. Its performance can thus be calculated as a function of the host platform's peak memory bandwidth and the implementation's memory efficiency, as 2.03 × peak bandwidth × memory efficiency. Our FPGA-based platform has a peak bandwidth of 76.8 GB/s and our implementation achieves a memory efficiency of approximately 50%, which gives an average throughput of 78 Gflops. This represents a ~40X speedup when compared with BEAGLE's CPU implementation on a dual Xeon 5520 and 3X speedup versus BEAGLE's GPU implementation on a Tesla T10 GPU for very large data sizes. The power consumption is 92 W, yielding a power efficiency of 1.7 Gflops per Watt.
The use of data parallel architectures to achieve high performance for likelihood-based phylogenetic inference requires high memory bandwidth and a design methodology that emphasizes high memory efficiency. To achieve this objective, we integrated 32 pipelined processing elements (PEs) across four FPGAs. For the design of each PE, we developed a specialized synthesis tool to generate a floating-point pipeline with resource and throughput constraints to match the target platform. We have found that using low-latency floating-point operators can significantly reduce FPGA area and still meet timing requirement on the target platform. We found that this design methodology can achieve performance that exceeds that of a GPU-based coprocessor.
基于最大似然法(ML)的系统发育推断使用费舍尔的修剪算法是一种基于 DNA 序列数据估计一组物种进化关系的标准方法,广泛应用于 RAxML、PHYLIP、GARLI、BEAST 和 MrBayes 等流行应用程序中。系统发育似然函数(PLF)及其相关的缩放和归一化步骤构成了这些工具的计算内核。这些计算是数据密集型的,但包含可以通过协处理器架构(如 FPGA 和 GPU)利用的细粒度并行性。最近开发了一种名为 BEAGLE 的通用 API,它为各种数据并行架构包含了费舍尔修剪算法的优化实现。在本文中,我们将 BEAGLE API 扩展到一个名为 Convey HC-1 的基于多现场可编程门阵列(FPGA)的平台。
我们实现的核心计算,包括系统发育似然函数(PLF)和树似然计算,其算术强度为每 64 字节输入/输出 130 个浮点运算,即 2.03 个操作/字节。因此,其性能可以根据主机平台的峰值内存带宽和实现的内存效率来计算,即 2.03×峰值带宽×内存效率。我们的基于 FPGA 的平台的峰值带宽为 76.8GB/s,我们的实现实现了大约 50%的内存效率,这给出了 78Gflops 的平均吞吐量。与双 Xeon 5520 上的 BEAGLE 的 CPU 实现相比,这表示了约 40 倍的速度提升,与 Tesla T10 GPU 上的 BEAGLE 的 GPU 实现相比,速度提升了 3 倍,对于非常大的数据大小。功耗为 92W,效率为 1.7Gflops/W。
使用数据并行架构实现基于似然的系统发育推断的高性能需要高内存带宽和强调高内存效率的设计方法。为了实现这一目标,我们在四个 FPGA 上集成了 32 个流水线处理元件(PE)。对于每个 PE 的设计,我们开发了一个专门的综合工具,生成一个具有资源和吞吐量约束的浮点流水线,以匹配目标平台。我们发现使用低延迟浮点运算符可以显著减少 FPGA 面积,同时满足目标平台的时序要求。我们发现这种设计方法可以实现超过基于 GPU 的协处理器的性能。