Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, Ghent (Zwijnaarde), B-9052, Belgium.
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):81. doi: 10.1186/s12859-020-3348-6.
The identification of all matches of a large set of position weight matrices (PWMs) in long DNA sequences requires significant computational resources for which a number of efficient yet complex algorithms have been proposed.
We propose BLAMM, a simple and efficient tool inspired by high performance computing techniques. The workload is expressed in terms of matrix-matrix products that are evaluated with high efficiency using optimized BLAS library implementations. The algorithm is easy to parallelize and implement on CPUs and GPUs and has a runtime that is independent of the selected p-value. In terms of single-core performance, it is competitive with state-of-the-art software for PWM matching while being much more efficient when using multithreading. Additionally, BLAMM requires negligible memory. For example, both strands of the entire human genome can be scanned for 1404 PWMs in the JASPAR database in 13 min with a p-value of 10 using a 36-core machine. On a dual GPU system, the same task can be performed in under 5 min.
BLAMM is an efficient tool for identifying PWM matches in large DNA sequences. Its C++ source code is available under the GNU General Public License Version 3 at https://github.com/biointec/blamm.
在长 DNA 序列中识别大量位置权重矩阵 (PWMs) 的所有匹配项需要大量的计算资源,为此已经提出了许多高效但复杂的算法。
我们提出了 BLAMM,这是一种受高性能计算技术启发的简单而高效的工具。工作负载表示为矩阵-矩阵乘积,使用经过优化的 BLAS 库实现高效地评估。该算法易于在 CPU 和 GPU 上并行化和实现,并且其运行时间与所选 p 值无关。在单核性能方面,它与 PWM 匹配的最先进软件具有竞争力,而在使用多线程时效率更高。此外,BLAMM 需要的内存很少。例如,在一台 36 核机器上,使用 p 值为 10,可以在 13 分钟内扫描整个人类基因组的两条链,以查找 JASPAR 数据库中的 1404 个 PWM。在双 GPU 系统上,相同的任务可以在不到 5 分钟内完成。
BLAMM 是一种用于在大型 DNA 序列中识别 PWM 匹配项的高效工具。其 C++源代码可在 https://github.com/biointec/blamm 下根据 GNU 通用公共许可证第 3 版获得。