School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA.
Department of Computer Science, North Carolina State University, Raleigh, North Carolina, USA.
J Comput Biol. 2022 Dec;29(12):1377-1396. doi: 10.1089/cmb.2022.0411. Epub 2022 Nov 25.
The problem of aligning a sequence to a walk in a labeled graph is of fundamental importance to Computational Biology. For an arbitrary graph and a pattern of length , a lower bound based on the Strong Exponential Time Hypothesis implies that an algorithm for finding a walk in exactly matching significantly faster than time is unlikely. However, for many special graphs, such as de Bruijn graphs, the problem can be solved in linear time. For approximate matching, the picture is more complex. When edits (substitutions, insertions, and deletions) are only allowed to the pattern, or when the graph is acyclic, the problem is solvable in time. When edits are allowed to arbitrary cyclic graphs, the problem becomes NP-complete, even on binary alphabets. Moreover, NP-completeness continues to hold even when edits are restricted to only substitutions. Despite the popularity of the de Bruijn graphs in Computational Biology, the complexity of approximate pattern matching on the de Bruijn graphs remained unknown. We investigate this problem and show that the properties that make the de Bruijn graphs amenable to efficient exact pattern matching do not extend to approximate matching, even when restricted to the substitutions only case with alphabet size four. Specifically, we prove that determining the existence of a matching walk in a de Bruijn graph is NP-complete when substitutions are allowed to the graph. We also demonstrate that an algorithm significantly faster than is unlikely for the de Bruijn graphs in the case where substitutions are only allowed to the pattern. This stands in contrast to pattern-to-text matching where exact matching is solvable in linear time, such as on the de Bruijn graphs, but approximate matching under substitutions is solvable in subquadratic time, where is the text's length.
将序列与标记图中的路径对齐的问题对计算生物学具有重要意义。对于任意图 和长度为 的模式 ,基于强指数时间假设的下界意味着,找到与 精确匹配的 中的路径的算法不太可能比 时间快。然而,对于许多特殊图,例如 de Bruijn 图,该问题可以在线性时间内解决。对于近似匹配,情况更为复杂。当仅允许对模式进行编辑(替换、插入和删除)时,或者当图是无环时,该问题可以在 时间内解决。当编辑允许任意循环图时,该问题就会变得 NP 完全,即使在二进制字母表上也是如此。此外,即使编辑仅限于替换,NP 完全性仍然成立。尽管 de Bruijn 图在计算生物学中很受欢迎,但在 de Bruijn 图上进行近似模式匹配的复杂性仍然未知。我们研究了这个问题,并表明,使 de Bruijn 图易于进行高效精确模式匹配的特性并不能扩展到近似匹配,即使在仅允许替换的情况下也是如此,并且字母表大小为 4。具体来说,我们证明了当允许对图进行替换时,在 de Bruijn 图中确定是否存在匹配路径是 NP 完全的。我们还证明了在仅允许替换模式的情况下,对于 de Bruijn 图,不太可能存在比 更快的算法。这与模式到文本匹配形成对比,在模式到文本匹配中,精确匹配可以在线性时间内解决,例如在 de Bruijn 图上,但在允许替换的情况下,近似匹配可以在亚二次 时间内解决,其中 是文本的长度。