Wang Rongjie, Li Junyi, Bai Yang, Zang Tianyi, Wang Yadong
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, HeiLongJiang, China.
School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, China.
PeerJ. 2018 Oct 19;6:e5611. doi: 10.7717/peerj.5611. eCollection 2018.
Dramatic increases in data produced by next-generation sequencing (NGS) technologies demand data compression tools for saving storage space. However, effective and efficient data compression for genome sequencing data has remained an unresolved challenge in NGS data studies. In this paper, we propose a novel alignment-free and reference-free compression method, BdBG, which is the first to compress genome sequencing data with dynamic de Bruijn graphs based on the data after bucketing. Compared with existing de Bruijn graph methods, BdBG only stored a list of bucket indexes and bifurcations for the raw read sequences, and this feature can effectively reduce storage space. Experimental results on several genome sequencing datasets show the effectiveness of BdBG over three state-of-the-art methods. BdBG is written in python and it is an open source software distributed under the MIT license, available for download at https://github.com/rongjiewang/BdBG.
下一代测序(NGS)技术产生的数据急剧增加,这就需要数据压缩工具来节省存储空间。然而,对基因组测序数据进行有效且高效的数据压缩在NGS数据研究中仍是一个未解决的挑战。在本文中,我们提出了一种新颖的无比对和无参考压缩方法BdBG,这是首个基于分桶后的数据,用动态德布鲁因图对基因组测序数据进行压缩的方法。与现有的德布鲁因图方法相比,BdBG仅存储原始读段序列的桶索引列表和分支,这一特性能够有效减少存储空间。在多个基因组测序数据集上的实验结果表明,BdBG优于三种最先进的方法。BdBG用Python编写,是根据麻省理工学院许可分发的开源软件,可在https://github.com/rongjiewang/BdBG下载。