改进的哈夫比特压缩算法 - R的一种应用

Modified HuffBit Compress Algorithm - An Application of R.

作者信息

Habib Nahida, Ahmed Kawsar, Jabin Iffat, Rahman Mohammad Motiur

机构信息

Department of Computer Science and Engineering (CSE), Mawlana Bhashani Science and Technology University (MBSTU), Santosh, Tangail 1902, Bangladesh.

Department of Information and Communication Technology (ICT), Mawlana Bhashani Science and Technology University (MBSTU), Tangail, Bangladesh.

出版信息

J Integr Bioinform. 2018 Feb 22;15(3):20170057. doi: 10.1515/jib-2017-0057.

DOI:10.1515/jib-2017-0057

PMID:29470175

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6340127/

Abstract

The databases of genomic sequences are growing at an explicative rate because of the increasing growth of living organisms. Compressing deoxyribonucleic acid (DNA) sequences is a momentous task as the databases are getting closest to its threshold. Various compression algorithms are developed for DNA sequence compression. An efficient DNA compression algorithm that works on both repetitive and non-repetitive sequences known as "HuffBit Compress" is based on the concept of Extended Binary Tree. In this paper, here is proposed and developed a modified version of "HuffBit Compress" algorithm to compress and decompress DNA sequences using the R language which will always give the Best Case of the compression ratio but it uses extra 6 bits to compress than best case of "HuffBit Compress" algorithm and can be named as the "Modified HuffBit Compress Algorithm". The algorithm makes an extended binary tree based on the Huffman Codes and the maximum occurring bases (A, C, G, T). Experimenting with 6 sequences the proposed algorithm gives approximately 16.18 % improvement in compression ration over the "HuffBit Compress" algorithm and 11.12 % improvement in compression ration over the "2-Bits Encoding Method".

摘要

由于生物数量的不断增加，基因组序列数据库正以惊人的速度增长。随着数据库接近其容量极限，压缩脱氧核糖核酸（DNA）序列成为一项重大任务。人们开发了各种DNA序列压缩算法。一种名为“HuffBit Compress”的高效DNA压缩算法，它基于扩展二叉树的概念，对重复和非重复序列均有效。本文提出并开发了“HuffBit Compress”算法的改进版本，使用R语言对DNA序列进行压缩和解压缩，该版本总能给出最佳压缩率，但比“HuffBit Compress”算法的最佳情况多使用6位进行压缩，可称为“改进的HuffBit Compress算法”。该算法基于哈夫曼编码和出现频率最高的碱基（A、C、G、T）构建扩展二叉树。通过对6个序列进行实验，结果表明，与“HuffBit Compress”算法相比，该算法的压缩率提高了约16.18%，与“2位编码方法”相比，压缩率提高了11.12%。