探索正则表达式匹配中的高效分组算法。

Exploring efficient grouping algorithms in regular expression matching.

机构信息

College of Computer, National University of Defense Technology, Changsha, Hunan, China.

National Key Laboratory for Parallel and Distributed Processing, Changsha, Hunan, China.

出版信息

PLoS One. 2018 Oct 24;13(10):e0206068. doi: 10.1371/journal.pone.0206068. eCollection 2018.

DOI:10.1371/journal.pone.0206068

PMID:30356262

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6200238/

Abstract

BACKGROUND

Regular expression matching (REM) is widely employed as the major tool for deep packet inspection (DPI) applications. For automatic processing, the regular expression patterns need to be converted to a deterministic finite automata (DFA). However, with the ever-increasing scale and complexity of pattern sets, state explosion problem has brought a great challenge to the DFA based regular expression matching. Rule grouping is a direct method to solve the state explosion problem. The original rule set is divided into multiple disjoint groups, and each group is compiled to a separate DFA, thus to significantly restrain the severe state explosion problem when compiling all the rules to a single DFA.

OBJECTIVE

For practical implementation, the total number of DFA states should be as few as possible, thus the data structures of these DFAs can be deployed on fast on-chip memories for rapid access. In addition, to support fast pattern update in some applications, the time cost for grouping should be as small as possible. In this study, we aimed to propose an efficient grouping method, which generates as few states as possible with as little time overhead as possible.

METHODS

When compiling multiple patterns into a single DFA, the number of DFA states is usually greater than the total number of states when compiling each pattern to a separate DFA. This is mainly caused by the semantic overlaps among different rules. By quantifying the interaction values for each pair of rules, the rule grouping problem can be reduced to the maximum k-cut graph partitioning problem. Then, we propose a heuristic algorithm called the one-step greedy (OSG) algorithm to solve this NP-hard problem. What's more, a subroutine named the heuristic initialization (HI) algorithm is devised to further optimize the grouping algorithms.

RESULTS

We employed three practical rule sets for the experimental evaluation. Results show that the OSG algorithm outperforms the state-of-the-art grouping solutions regarding both the total number of DFA states and time cost for grouping. The HI subroutine also demonstrates its significant optimization effect on the grouping algorithms.

CONCLUSIONS

The DFA state explosion problem has became the most challenging issue in the regular expression matching applications. Rule grouping is a practical direction by dividing the original rule sets into multiple disjoint groups. In this paper, we investigate the current grouping solutions, and propose a compact and efficient grouping algorithm. Experiments conducted on practical rule sets demonstrate the superiority of our proposal.

摘要

背景

正则表达式匹配（REM）被广泛应用于深度包检测（DPI）应用中。为了实现自动处理，需要将正则表达式模式转换为确定有限自动机（DFA）。然而，随着模式集规模和复杂度的不断增加，状态爆炸问题给基于 DFA 的正则表达式匹配带来了巨大的挑战。规则分组是解决状态爆炸问题的直接方法。将原始规则集划分为多个不相交的组，然后将每个组编译成一个单独的 DFA，从而显著抑制了将所有规则编译成单个 DFA 时的严重状态爆炸问题。

目的

对于实际实现，DFA 状态的总数应尽可能少，以便可以将这些 DFA 的数据结构部署在快速的片上内存中以实现快速访问。此外，为了支持某些应用中的快速模式更新，分组所需的时间开销应尽可能小。在这项研究中，我们旨在提出一种高效的分组方法，该方法可以用尽可能少的时间开销生成尽可能少的状态。

方法

在将多个模式编译成单个 DFA 时，DFA 状态的数量通常大于将每个模式编译成单独的 DFA 时的总状态数量。这主要是由于不同规则之间的语义重叠造成的。通过量化每个规则对之间的交互值，可以将规则分组问题简化为最大 k-割图分区问题。然后，我们提出了一种称为一步贪婪（OSG）算法的启发式算法来解决这个 NP 难问题。此外，还设计了一个名为启发式初始化（HI）算法的子程序来进一步优化分组算法。