IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):1587-1593. doi: 10.1109/TCBB.2022.3208867.
Motif Searching is an important problem that can reveal crucial information from biological data. Since the general motif searching is NP-hard and the volume of biological data is growing exponentially in recent years, there is a pressing need for developing time and space-efficient algorithms to find motifs. In this paper, we explore scalable parallelization for Edit Distance-Based Motif Search (EMS). We introduce two parallel designs, recursEMS which integrates the existing EMS solver into a parallel recursion tree running in multiple processes, and parEMS that presents a novel thread-based method which avoids the storage of redundant motif candidates. To make the parallel designs practical, we implement SPEMS, a Scalability-sensitive Parallel solver for EMS. For any given biological dataset and search instance, SPEMS can provide an EMS parallelization towards the optimal performance, or a sub-optimal performance but being more space efficient. Evaluations on two real-world DNA dataset TRANSFAC and ChIP-seq show that SPEMS can obtain 10× geometric mean speedup over the state-of-the-art at the expense of no less than 74.7% memory overheads, or provide 2.2× geometric mean speedup with the possibility of consuming less memory, when running on a 48-core machine.
模体搜索是一个重要的问题,可以从生物数据中揭示关键信息。由于一般的模体搜索是 NP 难问题,并且近年来生物数据的数量呈指数级增长,因此迫切需要开发时间和空间效率高的算法来寻找模体。在本文中,我们探索了基于编辑距离的模体搜索(EMS)的可扩展并行化。我们引入了两种并行设计,recursEMS 将现有的 EMS 求解器集成到多个进程中运行的并行递归树中,parEMS 提出了一种新的基于线程的方法,避免了冗余模体候选的存储。为了使并行设计实用化,我们实现了 SPEMS,这是一种针对 EMS 的可扩展并行求解器。对于任何给定的生物数据集和搜索实例,SPEMS 都可以提供 EMS 并行化,以实现最佳性能,或者提供次优性能但更节省空间。在两个真实的 DNA 数据集 TRANSFAC 和 ChIP-seq 上的评估表明,SPEMS 可以在不超过 74.7%的内存开销的情况下,获得比最先进的方法快 10 倍的几何平均加速,或者在可能消耗更少内存的情况下提供 2.2 倍的几何平均加速,当在 48 核机器上运行时。