生物序列压缩算法。

Biological sequence compression algorithms.

作者信息

Matsumoto T, Sadakane K, Imai H

机构信息

Department of Information Science, University of Tokyo,7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan.

出版信息

Genome Inform Ser Workshop Genome Inform. 2000;11:43-52.

PMID:11700586

Abstract

Today, more and more DNA sequences are becoming available. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases will be bigger and bigger in the future, therefore this information must be stored or communicated efficiently. Furthermore, sequence compression can be used to define similarities between biological sequences. The standard compression algorithms such as gzip or compress cannot compress DNA sequences, but only expand them in size. On the other hand, CTW (Context Tree Weighting Method) can compress DNA sequences less than two bits per symbol. These algorithms do not use special structures of biological sequences. Two characteristic structures of DNA sequences are known. One is called palindromes or reverse complements and the other structure is approximate repeats. Several specific algorithms for DNA sequences that use these structures can compress them less than two bits per symbol. In this paper, we improve the CTW so that characteristic structures of DNA sequences are available. Before encoding the next symbol, the algorithm searches an approximate repeat and palindrome using hash and dynamic programming. If there is a palindrome or an approximate repeat with enough length then our algorithm represents it with length and distance. By using this preprocessing, a new program achieves a little higher compression ratio than that of existing DNA-oriented compression algorithms. We also describe new compression algorithm for protein sequences.

摘要

如今，越来越多的DNA序列可供使用。有关DNA序列的信息存储在分子生物学数据库中。这些数据库的规模和重要性在未来将越来越大，因此必须高效地存储或传递这些信息。此外，序列压缩可用于定义生物序列之间的相似性。诸如gzip或compress之类的标准压缩算法无法压缩DNA序列，反而会使其大小增加。另一方面，上下文树加权法（CTW）可以将DNA序列压缩至每个符号不到两位。这些算法并未利用生物序列的特殊结构。已知DNA序列有两种特征结构。一种称为回文或反向互补序列，另一种结构是近似重复序列。几种利用这些结构的DNA序列特定算法可以将其压缩至每个符号不到两位。在本文中，我们改进了CTW，以便能够利用DNA序列的特征结构。在对下一个符号进行编码之前，该算法使用哈希和动态规划搜索近似重复序列和回文序列。如果存在足够长度的回文序列或近似重复序列，那么我们的算法会用长度和距离来表示它。通过这种预处理，一个新程序实现了比现有面向DNA的压缩算法略高的压缩率。我们还描述了一种新的蛋白质序列压缩算法。

相似文献

Biological sequence compression algorithms.生物序列压缩算法。

Genome Inform Ser Workshop Genome Inform. 2000;11:43-52.

A lossless compression algorithm for DNA sequences.一种用于DNA序列的无损压缩算法。

Int J Bioinform Res Appl. 2009;5(6):593-602. doi: 10.1504/IJBRA.2009.02904.

Modified HuffBit Compress Algorithm - An Application of R.改进的哈夫比特压缩算法 - R的一种应用

J Integr Bioinform. 2018 Feb 22;15(3):20170057. doi: 10.1515/jib-2017-0057.

SeqCompress: an algorithm for biological sequence compression.SeqCompress：一种用于生物序列压缩的算法。

Genomics. 2014 Oct;104(4):225-8. doi: 10.1016/j.ygeno.2014.08.007. Epub 2014 Aug 27.

DNA sequence compression using the burrows-wheeler transform.使用Burrows-Wheeler变换的DNA序列压缩

Proc IEEE Comput Soc Bioinform Conf. 2002;1:303-13.

DNABIT Compress - Genome compression algorithm.DNABIT压缩 - 基因组压缩算法。

Bioinformation. 2011 Jan 22;5(8):350-60. doi: 10.6026/97320630005350.

Discovering simple DNA sequences by compression.通过压缩发现简单DNA序列。

Pac Symp Biocomput. 1998:597-608.

GATA: a graphic alignment tool for comparative sequence analysis.GATA：一种用于比较序列分析的图形比对工具。

BMC Bioinformatics. 2005 Jan 17;6:9. doi: 10.1186/1471-2105-6-9.

T-REKS: identification of Tandem REpeats in sequences with a K-meanS based algorithm.T-REKS：基于 K-均值算法的序列中串联重复序列的识别。

Bioinformatics. 2009 Oct 15;25(20):2632-8. doi: 10.1093/bioinformatics/btp482. Epub 2009 Aug 11.

Iterative dictionary construction for compression of large DNA data sets.迭代字典构建用于大型 DNA 数据集的压缩。

IEEE/ACM Trans Comput Biol Bioinform. 2012 Jan-Feb;9(1):137-49. doi: 10.1109/TCBB.2011.82. Epub 2011 Apr 27.

引用本文的文献

CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments.查包算法：生物分子序列的可能性和分层参考表示及其在多重序列比对压缩中的应用。

PLoS One. 2022 Apr 18;17(4):e0265360. doi: 10.1371/journal.pone.0265360. eCollection 2022.

Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes.基于压缩的度量方法在灵长类基因组进化中的应用比较

Entropy (Basel). 2018 May 23;20(6):393. doi: 10.3390/e20060393.

A compression method for DNA.一种 DNA 的压缩方法。

PLoS One. 2020 Nov 25;15(11):e0238220. doi: 10.1371/journal.pone.0238220. eCollection 2020.

Efficient DNA sequence compression with neural networks.神经网络高效 DNA 序列压缩。

Gigascience. 2020 Nov 11;9(11). doi: 10.1093/gigascience/giaa119.

Tackling the Challenges of FASTQ Referential Compression.应对FASTQ参考压缩的挑战。

Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. eCollection 2019.

Converting DNA and chemical fingerprints into two-dimensional barcode.将DNA和化学指纹转换为二维条形码。

J Ginseng Res. 2017 Jul;41(3):339-346. doi: 10.1016/j.jgr.2016.06.006. Epub 2016 Jul 21.

An Optimal Seed Based Compression Algorithm for DNA Sequences.一种用于DNA序列的基于最优种子的压缩算法。

Adv Bioinformatics. 2016;2016:3528406. doi: 10.1155/2016/3528406. Epub 2016 Jul 31.

An Adaptive Difference Distribution-based Coding with Hierarchical Tree Structure for DNA Sequence Compression.一种基于自适应差异分布并带有层次树结构的DNA序列压缩编码方法。

Proc Data Compress Conf. 2013;2013:371-380. doi: 10.1109/DCC.2013.45. Epub 2013 Mar 22.

An alignment-free method to find and visualise rearrangements between pairs of DNA sequences.一种用于查找和可视化DNA序列对之间重排的无比对方法。

Sci Rep. 2015 May 18;5:10203. doi: 10.1038/srep10203.

Reference-based compression of short-read sequences using path encoding.使用路径编码对短读长序列进行基于参考的压缩。

Bioinformatics. 2015 Jun 15;31(12):1920-8. doi: 10.1093/bioinformatics/btv071. Epub 2015 Feb 2.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

生物序列压缩算法。

Biological sequence compression algorithms.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献