Prunella N, Liuni S, Attimonelli M, Pesole G
Digital Equipment S.p.A., Bari, Italy.
Comput Appl Biosci. 1993 Oct;9(5):541-5. doi: 10.1093/bioinformatics/9.5.541.
A new string searching algorithm is presented aimed at searching for the occurrence of character patterns in longer character texts. The algorithm, specifically designed for nucleic acid sequence data, is essentially derived from the Boyer-Moore method (Comm. ACM, 20, 762-772, 1977). Both pattern and text data are compressed so that the natural 4-letter alphabet of nucleic acid sequences is considerably enlarged. The string search starts from the last character of the pattern and proceeds in large jumps through the text to be searched. The data compression and searching algorithm allows one to avoid searching for patterns not present in the text as well as to inspect, for each pattern, all text characters until the exact match with the text is found. These considerations are supported by empirical evidence and comparisons with other methods.
提出了一种新的字符串搜索算法,旨在搜索较长字符文本中字符模式的出现位置。该算法是专门为核酸序列数据设计的,本质上源自博耶 - 摩尔方法(《美国计算机协会通讯》,20,762 - 772,1977)。模式和文本数据都经过压缩,从而使核酸序列自然的4字母字母表得到大幅扩充。字符串搜索从模式的最后一个字符开始,通过大幅跳跃在待搜索文本中进行。数据压缩和搜索算法使人们能够避免搜索文本中不存在的模式,并且对于每个模式,能够检查所有文本字符,直到找到与文本的精确匹配。这些考虑得到了经验证据以及与其他方法比较结果的支持。