Rajarajeswari Pothuraju, Apparao Allam
Bioinformation. 2011 Jan 22;5(8):350-60. doi: 10.6026/97320630005350.
Data compression is concerned with how information is organized in data. Efficient storage means removal of redundancy from the data being stored in the DNA molecule. Data compression algorithms remove redundancy and are used to understand biologically important molecules. We present a compression algorithm, "DNABIT Compress" for DNA sequences based on a novel algorithm of assigning binary bits for smaller segments of DNA bases to compress both repetitive and non repetitive DNA sequence. Our proposed algorithm achieves the best compression ratio for DNA sequences for larger genome. Significantly better compression results show that "DNABIT Compress" algorithm is the best among the remaining compression algorithms. While achieving the best compression ratios for DNA sequences (Genomes),our new DNABIT Compress algorithm significantly improves the running time of all previous DNA compression programs. Assigning binary bits (Unique BIT CODE) for (Exact Repeats, Reverse Repeats) fragments of DNA sequence is also a unique concept introduced in this algorithm for the first time in DNA compression. This proposed new algorithm could achieve the best compression ratio as much as 1.58 bits/bases where the existing best methods could not achieve a ratio less than 1.72 bits/bases.
数据压缩关注的是信息在数据中如何组织。高效存储意味着从存储在DNA分子中的数据中去除冗余。数据压缩算法可去除冗余,并用于理解具有生物学重要性的分子。我们基于一种为DNA碱基的较小片段分配二进制位的新算法,提出了一种针对DNA序列的压缩算法“DNABIT Compress”,以压缩重复和非重复DNA序列。我们提出的算法在较大基因组的DNA序列上实现了最佳压缩率。显著更好的压缩结果表明“DNABIT Compress”算法在其余压缩算法中是最佳的。在实现DNA序列(基因组)的最佳压缩率的同时,我们新的DNABIT Compress算法显著提高了所有先前DNA压缩程序的运行时间。为DNA序列的(精确重复、反向重复)片段分配二进制位(唯一比特码)也是该算法在DNA压缩中首次引入的独特概念。该新算法可实现高达1.58比特/碱基的最佳压缩率,而现有的最佳方法无法达到低于1.72比特/碱基的比率。