一种基于自适应差异分布并带有层次树结构的DNA序列压缩编码方法。

An Adaptive Difference Distribution-based Coding with Hierarchical Tree Structure for DNA Sequence Compression.

作者信息

Dai Wenrui, Xiong Hongkai, Jiang Xiaoqian, Ohno-Machado Lucila

机构信息

Department of Electronic Engineering Shanghai Jiaotong University Shanghai 200240, China,

Division of Biomedical Informatics University of California, San Diego San Diego, CA 92093, USA,

出版信息

Proc Data Compress Conf. 2013;2013:371-380. doi: 10.1109/DCC.2013.45. Epub 2013 Mar 22.

DOI:10.1109/DCC.2013.45

PMID:26501129

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4617277/

Abstract

Previous reference-based compression on DNA sequences do not fully exploit the intrinsic statistics by merely concerning the approximate matches. In this paper, an adaptive difference distribution-based coding framework is proposed by the fragments of nucleotides with a hierarchical tree structure. To keep the distribution of difference sequence from the reference and target sequences concentrated, the sub-fragment size and matching offset for predicting are flexible to the stepped size structure. The matching with approximate repeats in reference will be imposed with the Hamming-like weighted distance measure function in a local region closed to the current fragment, such that the accuracy of matching and the overhead of describing matching offset can be balanced. A well-designed coding scheme will make compact both the difference sequence and the additional parameters, e.g. sub-fragment size and matching offset. Experimental results show that the proposed scheme achieves 150% compression improvement in comparison with the best reference-based compressor GReEn.

摘要

先前基于参考的DNA序列压缩方法，仅关注近似匹配，并未充分利用其内在统计特性。本文提出了一种基于自适应差异分布的编码框架，该框架通过具有层次树结构的核苷酸片段来实现。为使参考序列与目标序列的差异序列分布保持集中，预测的子片段大小和匹配偏移量对于阶梯大小结构具有灵活性。在当前片段附近的局部区域，将使用类似汉明加权距离度量函数对参考序列中的近似重复进行匹配，从而平衡匹配精度与描述匹配偏移量的开销。精心设计的编码方案将使差异序列和附加参数（如子片段大小和匹配偏移量）都变得紧凑。实验结果表明，与最佳的基于参考的压缩器GReEn相比，该方案的压缩率提高了150%。

相似文献

An Adaptive Difference Distribution-based Coding with Hierarchical Tree Structure for DNA Sequence Compression.一种基于自适应差异分布并带有层次树结构的DNA序列压缩编码方法。

Proc Data Compress Conf. 2013;2013:371-380. doi: 10.1109/DCC.2013.45. Epub 2013 Mar 22.

Error Tree: A Tree Structure for Hamming and Edit Distances and Wildcards Matching.错误树：用于汉明距离、编辑距离和通配符匹配的树结构。

J Comput Biol. 2015 Dec;22(12):1118-28. doi: 10.1089/cmb.2015.0132. Epub 2015 Sep 24.

CoGI: Towards Compressing Genomes as an Image.CoGI：迈向将基因组压缩为图像

IEEE/ACM Trans Comput Biol Bioinform. 2015 Nov-Dec;12(6):1275-85. doi: 10.1109/TCBB.2015.2430331.

Biological sequence compression algorithms.生物序列压缩算法。

Genome Inform Ser Workshop Genome Inform. 2000;11:43-52.

2D-pattern matching image and video compression: theory, algorithms, and experiments.二维模式匹配图像与视频压缩：理论、算法及实验

IEEE Trans Image Process. 2002;11(3):318-31. doi: 10.1109/83.988964.

Adaptive Quantization Parameter Cascading in HEVC Hierarchical Coding.HEVC 分层编码中的自适应量化参数级联。

IEEE Trans Image Process. 2016 Jul;25(7):2997-3009. doi: 10.1109/TIP.2016.2556941. Epub 2016 Apr 20.

Video compression with binary tree recursive motion estimation and binary tree residue coding.基于二叉树递归运动估计和二叉树残差编码的视频压缩

IEEE Trans Image Process. 2000;9(7):1288-92. doi: 10.1109/83.847841.

Compression of Multiple DNA Sequences Using Intra-Sequence and Inter-Sequence Similarities.利用序列内和序列间相似性对多个DNA序列进行压缩

IEEE/ACM Trans Comput Biol Bioinform. 2015 Nov-Dec;12(6):1322-32. doi: 10.1109/TCBB.2015.2403370.

On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

libFLASM: a software library for fixed-length approximate string matching.libFLASM：一个用于固定长度近似字符串匹配的软件库。

BMC Bioinformatics. 2016 Nov 10;17(1):454. doi: 10.1186/s12859-016-1320-2.

引用本文的文献

Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes.基于压缩的度量方法在灵长类基因组进化中的应用比较

Entropy (Basel). 2018 May 23;20(6):393. doi: 10.3390/e20060393.

Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.用于组装基因组的垂直无损基因组数据压缩工具：系统文献回顾。

PLoS One. 2020 May 26;15(5):e0232942. doi: 10.1371/journal.pone.0232942. eCollection 2020.

本文引用的文献

GReEn: a tool for efficient compression of genome resequencing data.GReEn：一种用于高效压缩基因组重测序数据的工具。

Nucleic Acids Res. 2012 Feb;40(4):e27. doi: 10.1093/nar/gkr1124. Epub 2011 Dec 1.

On the future of genomic data.论基因组数据的未来。

Science. 2011 Feb 11;331(6018):728-9. doi: 10.1126/science.1197891.

A novel compression tool for efficient storage of genome resequencing data.一种用于高效存储基因组重测序数据的新型压缩工具。

Nucleic Acids Res. 2011 Apr;39(7):e45. doi: 10.1093/nar/gkr009. Epub 2011 Jan 25.

Data structures and compression algorithms for genomic sequence data.用于基因组序列数据的数据结构和压缩算法。

Bioinformatics. 2009 Jul 15;25(14):1731-8. doi: 10.1093/bioinformatics/btp319. Epub 2009 May 15.

Human genomes as email attachments.作为电子邮件附件的人类基因组。

Bioinformatics. 2009 Jan 15;25(2):274-5. doi: 10.1093/bioinformatics/btn582. Epub 2008 Nov 7.

DNACompress: fast and effective DNA sequence compression.DNACompress：快速有效的DNA序列压缩

Bioinformatics. 2002 Dec;18(12):1696-8. doi: 10.1093/bioinformatics/18.12.1696.

Biological sequence compression algorithms.生物序列压缩算法。

Genome Inform Ser Workshop Genome Inform. 2000;11:43-52.

A compression algorithm for DNA sequences.

IEEE Eng Med Biol Mag. 2001 Jul-Aug;20(4):61-6. doi: 10.1109/51.940049.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验