Matsumoto T, Sadakane K, Imai H
Department of Information Science, University of Tokyo,7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan.
Genome Inform Ser Workshop Genome Inform. 2000;11:43-52.
Today, more and more DNA sequences are becoming available. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases will be bigger and bigger in the future, therefore this information must be stored or communicated efficiently. Furthermore, sequence compression can be used to define similarities between biological sequences. The standard compression algorithms such as gzip or compress cannot compress DNA sequences, but only expand them in size. On the other hand, CTW (Context Tree Weighting Method) can compress DNA sequences less than two bits per symbol. These algorithms do not use special structures of biological sequences. Two characteristic structures of DNA sequences are known. One is called palindromes or reverse complements and the other structure is approximate repeats. Several specific algorithms for DNA sequences that use these structures can compress them less than two bits per symbol. In this paper, we improve the CTW so that characteristic structures of DNA sequences are available. Before encoding the next symbol, the algorithm searches an approximate repeat and palindrome using hash and dynamic programming. If there is a palindrome or an approximate repeat with enough length then our algorithm represents it with length and distance. By using this preprocessing, a new program achieves a little higher compression ratio than that of existing DNA-oriented compression algorithms. We also describe new compression algorithm for protein sequences.
如今,越来越多的DNA序列可供使用。有关DNA序列的信息存储在分子生物学数据库中。这些数据库的规模和重要性在未来将越来越大,因此必须高效地存储或传递这些信息。此外,序列压缩可用于定义生物序列之间的相似性。诸如gzip或compress之类的标准压缩算法无法压缩DNA序列,反而会使其大小增加。另一方面,上下文树加权法(CTW)可以将DNA序列压缩至每个符号不到两位。这些算法并未利用生物序列的特殊结构。已知DNA序列有两种特征结构。一种称为回文或反向互补序列,另一种结构是近似重复序列。几种利用这些结构的DNA序列特定算法可以将其压缩至每个符号不到两位。在本文中,我们改进了CTW,以便能够利用DNA序列的特征结构。在对下一个符号进行编码之前,该算法使用哈希和动态规划搜索近似重复序列和回文序列。如果存在足够长度的回文序列或近似重复序列,那么我们的算法会用长度和距离来表示它。通过这种预处理,一个新程序实现了比现有面向DNA的压缩算法略高的压缩率。我们还描述了一种新的蛋白质序列压缩算法。