Roy Indranil, Aluru Srinivas
IEEE/ACM Trans Comput Biol Bioinform. 2016 Jan-Feb;13(1):99-111. doi: 10.1109/TCBB.2015.2430313.
Finding approximately conserved sequences, called motifs, across multiple DNA or protein sequences is an important problem in computational biology. In this paper, we consider the (l, d) motif search problem of identifying one or more motifs of length l present in at least q of the n given sequences, with each occurrence differing from the motif in at most d substitutions. The problem is known to be NP-complete, and the largest solved instance reported to date is (26,11). We propose a novel algorithm for the (l,d) motif search problem using streaming execution over a large set of non-deterministic finite automata (NFA). This solution is designed to take advantage of the micron automata processor, a new technology close to deployment that can simultaneously execute multiple NFA in parallel. We demonstrate the capability for solving much larger instances of the (l, d) motif search problem using the resources available within a single automata processor board, by estimating run-times for problem instances (39,18) and (40,17). The paper serves as a useful guide to solving problems using this new accelerator technology.
在多条DNA或蛋白质序列中寻找近似保守的序列(即基序)是计算生物学中的一个重要问题。在本文中,我们考虑(l, d)基序搜索问题,即在n条给定序列中识别至少q条序列中存在的长度为l的一个或多个基序,且每次出现与基序的差异最多为d个替换。已知该问题是NP完全问题,迄今为止报道的最大已解决实例是(26,11)。我们提出了一种用于(l, d)基序搜索问题的新颖算法,该算法通过在大量非确定性有限自动机(NFA)上进行流式执行。此解决方案旨在利用微米自动机处理器,这是一种即将部署的新技术,它可以同时并行执行多个NFA。通过估计问题实例(39,18)和(40,17)的运行时间,我们展示了使用单个自动机处理器板内可用资源解决更大规模(l, d)基序搜索问题实例的能力。本文为使用这种新的加速器技术解决问题提供了有用的指导。