Dai Wenrui, Xiong Hongkai, Jiang Xiaoqian, Ohno-Machado Lucila
Department of Electronic Engineering Shanghai Jiaotong University Shanghai 200240, China,
Division of Biomedical Informatics University of California, San Diego San Diego, CA 92093, USA,
Proc Data Compress Conf. 2013;2013:371-380. doi: 10.1109/DCC.2013.45. Epub 2013 Mar 22.
Previous reference-based compression on DNA sequences do not fully exploit the intrinsic statistics by merely concerning the approximate matches. In this paper, an adaptive difference distribution-based coding framework is proposed by the fragments of nucleotides with a hierarchical tree structure. To keep the distribution of difference sequence from the reference and target sequences concentrated, the sub-fragment size and matching offset for predicting are flexible to the stepped size structure. The matching with approximate repeats in reference will be imposed with the Hamming-like weighted distance measure function in a local region closed to the current fragment, such that the accuracy of matching and the overhead of describing matching offset can be balanced. A well-designed coding scheme will make compact both the difference sequence and the additional parameters, e.g. sub-fragment size and matching offset. Experimental results show that the proposed scheme achieves 150% compression improvement in comparison with the best reference-based compressor GReEn.
先前基于参考的DNA序列压缩方法,仅关注近似匹配,并未充分利用其内在统计特性。本文提出了一种基于自适应差异分布的编码框架,该框架通过具有层次树结构的核苷酸片段来实现。为使参考序列与目标序列的差异序列分布保持集中,预测的子片段大小和匹配偏移量对于阶梯大小结构具有灵活性。在当前片段附近的局部区域,将使用类似汉明加权距离度量函数对参考序列中的近似重复进行匹配,从而平衡匹配精度与描述匹配偏移量的开销。精心设计的编码方案将使差异序列和附加参数(如子片段大小和匹配偏移量)都变得紧凑。实验结果表明,与最佳的基于参考的压缩器GReEn相比,该方案的压缩率提高了150%。