Suppr超能文献

一种基于自适应差异分布并带有层次树结构的DNA序列压缩编码方法。

An Adaptive Difference Distribution-based Coding with Hierarchical Tree Structure for DNA Sequence Compression.

作者信息

Dai Wenrui, Xiong Hongkai, Jiang Xiaoqian, Ohno-Machado Lucila

机构信息

Department of Electronic Engineering Shanghai Jiaotong University Shanghai 200240, China,

Division of Biomedical Informatics University of California, San Diego San Diego, CA 92093, USA,

出版信息

Proc Data Compress Conf. 2013;2013:371-380. doi: 10.1109/DCC.2013.45. Epub 2013 Mar 22.

Abstract

Previous reference-based compression on DNA sequences do not fully exploit the intrinsic statistics by merely concerning the approximate matches. In this paper, an adaptive difference distribution-based coding framework is proposed by the fragments of nucleotides with a hierarchical tree structure. To keep the distribution of difference sequence from the reference and target sequences concentrated, the sub-fragment size and matching offset for predicting are flexible to the stepped size structure. The matching with approximate repeats in reference will be imposed with the Hamming-like weighted distance measure function in a local region closed to the current fragment, such that the accuracy of matching and the overhead of describing matching offset can be balanced. A well-designed coding scheme will make compact both the difference sequence and the additional parameters, e.g. sub-fragment size and matching offset. Experimental results show that the proposed scheme achieves 150% compression improvement in comparison with the best reference-based compressor GReEn.

摘要

先前基于参考的DNA序列压缩方法,仅关注近似匹配,并未充分利用其内在统计特性。本文提出了一种基于自适应差异分布的编码框架,该框架通过具有层次树结构的核苷酸片段来实现。为使参考序列与目标序列的差异序列分布保持集中,预测的子片段大小和匹配偏移量对于阶梯大小结构具有灵活性。在当前片段附近的局部区域,将使用类似汉明加权距离度量函数对参考序列中的近似重复进行匹配,从而平衡匹配精度与描述匹配偏移量的开销。精心设计的编码方案将使差异序列和附加参数(如子片段大小和匹配偏移量)都变得紧凑。实验结果表明,与最佳的基于参考的压缩器GReEn相比,该方案的压缩率提高了150%。

相似文献

3
CoGI: Towards Compressing Genomes as an Image.CoGI:迈向将基因组压缩为图像
IEEE/ACM Trans Comput Biol Bioinform. 2015 Nov-Dec;12(6):1275-85. doi: 10.1109/TCBB.2015.2430331.
4
Biological sequence compression algorithms.生物序列压缩算法。
Genome Inform Ser Workshop Genome Inform. 2000;11:43-52.
6
Adaptive Quantization Parameter Cascading in HEVC Hierarchical Coding.HEVC 分层编码中的自适应量化参数级联。
IEEE Trans Image Process. 2016 Jul;25(7):2997-3009. doi: 10.1109/TIP.2016.2556941. Epub 2016 Apr 20.
9
On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

本文引用的文献

2
On the future of genomic data.论基因组数据的未来。
Science. 2011 Feb 11;331(6018):728-9. doi: 10.1126/science.1197891.
4
Data structures and compression algorithms for genomic sequence data.用于基因组序列数据的数据结构和压缩算法。
Bioinformatics. 2009 Jul 15;25(14):1731-8. doi: 10.1093/bioinformatics/btp319. Epub 2009 May 15.
5
Human genomes as email attachments.作为电子邮件附件的人类基因组。
Bioinformatics. 2009 Jan 15;25(2):274-5. doi: 10.1093/bioinformatics/btn582. Epub 2008 Nov 7.
6
DNACompress: fast and effective DNA sequence compression.DNACompress:快速有效的DNA序列压缩
Bioinformatics. 2002 Dec;18(12):1696-8. doi: 10.1093/bioinformatics/18.12.1696.
7
Biological sequence compression algorithms.生物序列压缩算法。
Genome Inform Ser Workshop Genome Inform. 2000;11:43-52.
8
A compression algorithm for DNA sequences.
IEEE Eng Med Biol Mag. 2001 Jul-Aug;20(4):61-6. doi: 10.1109/51.940049.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验