Smith R
Public Health Research Institute, New York, NY 10016.
Comput Appl Biosci. 1988 Nov;4(4):459-65. doi: 10.1093/bioinformatics/4.4.459.
Existing algorithms for finding restriction endonuclease recognition sites use brute-force algorithms which run in time 0(NM) where N is the number of nucleotides in the sequence under analysis and M is the total number of nucleotides in all the different sites being searched for. This paper presents a deterministic finite state machine algorithm which runs in time 0(N). Memory use can be as high as 0(M4) but a slight modification to the basic algorithm can impose a theoretical upper bound of 0(M) at the cost of some added complexity in the execution of the state machine. The algorithm can operate with a single pass through the sequence under analysis, with no need to back up or (for non-circular sequences) store more than a single input character at a time. This type of algorithm can be adapted to many pattern-matching tasks and is simple enough to implement in hardware that it could, for example, be built into a disk controller as part of a specialized database machine.
现有的用于寻找限制性内切酶识别位点的算法使用暴力算法,其运行时间为O(NM),其中N是被分析序列中的核苷酸数量,M是正在搜索的所有不同位点中的核苷酸总数。本文提出了一种确定性有限状态机算法,其运行时间为O(N)。内存使用量可能高达O(M4),但对基本算法进行轻微修改可以将理论上限设为O(M),代价是状态机执行时会增加一些复杂度。该算法可以对被分析序列进行一次遍历操作,无需回溯(对于非环状序列),并且每次只需要存储一个输入字符。这种类型的算法可以适用于许多模式匹配任务,并且足够简单,可以在硬件中实现,例如,可以作为专用数据库机器的一部分内置到磁盘控制器中。