Department of Computer Engineering, Ege University, Izmir, Turkey.
Comput Biol Med. 2021 Sep;136:104656. doi: 10.1016/j.compbiomed.2021.104656. Epub 2021 Jul 17.
The string matching algorithms are among the essential fields in computer science, such as text search, intrusion detection systems, fraud detection, sequence search in bioinformatics. The exact string matching algorithms are divided into two parts: single and multiple. Multiple string matching algorithms involve finding elements of the pattern set P in a given input text T. String matching processes should be done in a time-efficient manner for DNA sequences. As the volume of the text T increases and the number of search patterns increases, the total runtime increases. Efficient algorithms should be selected to perform these search operations as soon as possible. In this study, the Wu-Manber algorithm, one of the multiple exact string matching algorithms, is improved. Although the Wu-Manber algorithm is effective, it has some limitations, such as hash collisions. In this study, the WM-q algorithm, a version of the Wu-Manber algorithm based on the perfect hash function for DNA sequences, is proposed. String matching is performed using different block lengths provided by the perfect hash function instead of using the fixed block length as in the traditional Wu-Manber algorithm. The proposed approach has been compared with E. Coli and Human Chromosome1 datasets, frequently used in the literature, using multiple exact string matching algorithms. The proposed algorithm gives better results for performance metrics such as the average runtime, the average number of characters and hash comparisons.
串匹配算法是计算机科学中的重要领域之一,例如文本搜索、入侵检测系统、欺诈检测、生物信息学中的序列搜索。确切的串匹配算法分为两类:单模式和多模式。多模式串匹配算法涉及在给定的输入文本 T 中查找模式集 P 的元素。对于 DNA 序列,串匹配过程应该以高效的方式进行。随着文本 T 的体积增加和搜索模式的数量增加,总运行时间会增加。应该选择有效的算法来尽快执行这些搜索操作。在本研究中,改进了多模式精确串匹配算法之一的 Wu-Manber 算法。虽然 Wu-Manber 算法是有效的,但它有一些局限性,例如哈希冲突。在本研究中,提出了基于完美哈希函数的 Wu-Manber 算法的 WM-q 版本,用于 DNA 序列。使用完美哈希函数提供的不同块长度而不是使用传统 Wu-Manber 算法中的固定块长度来执行串匹配。使用多模式精确串匹配算法,将所提出的方法与文献中常用的 E. Coli 和 Human Chromosome1 数据集进行了比较。对于平均运行时间、平均字符数和哈希比较等性能指标,所提出的算法给出了更好的结果。