DNA序列中的压缩模式匹配

Compressed pattern matching in DNA sequences.

作者信息

Chen Lei, Lu Shiyong, Ram Jeffrey

机构信息

Wayne State University, USA.

出版信息

Proc IEEE Comput Syst Bioinform Conf. 2004:62-8. doi: 10.1109/csb.2004.1332418.

DOI:10.1109/csb.2004.1332418

PMID:16448000

Abstract

We propose derivative Boyer-Moore (d-BM), a new compressed pattern matching algorithm in DNA sequences. This algorithm is based on the Boyer-Moore method, which is one of the most popular string matching algorithms. In this approach, we compress both DNA sequences and patterns by using two bits to represent each A, T, C, G character. Experiments indicate that this compressed pattern matching algorithm searches long DNA patterns (length > 50) more than 10 times faster than the exact match routine of the software package Agrep, which is known as the fastest pattern matching tool. Moreover, compression of DNA sequences by this method gives a guaranteed space saving of 75%. In part the enhanced speed of the algorithm is due to the increased efficiency of the Boyer-Moore method resulting from an increase in alphabet size from 4 to 256.

摘要

我们提出了派生的博耶 - 摩尔算法（d - BM），这是一种用于DNA序列的新型压缩模式匹配算法。该算法基于博耶 - 摩尔方法，它是最流行的字符串匹配算法之一。在这种方法中，我们通过使用两位来表示每个A、T、C、G字符，对DNA序列和模式都进行了压缩。实验表明，这种压缩模式匹配算法搜索长DNA模式（长度> 50）的速度比软件包Agrep的精确匹配程序快10倍以上，Agrep是已知最快的模式匹配工具。此外，用这种方法压缩DNA序列可保证节省75%的空间。该算法速度提高的部分原因是博耶 - 摩尔方法的效率提高，这是由于字母表大小从4增加到256所致。