Institute of Informatics, Silesian University of Technology, 44-100 Gliwice, Poland.
Bioinformatics. 2011 Nov 1;27(21):2979-86. doi: 10.1093/bioinformatics/btr505. Epub 2011 Sep 5.
Storing, transferring and maintaining genomic databases becomes a major challenge because of the rapid technology progress in DNA sequencing and correspondingly growing pace at which the sequencing data are being produced. Efficient compression, with support for extraction of arbitrary snippets of any sequence, is the key to maintaining those huge amounts of data.
We present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over an order of magnitude greater. In particular, 69 differentially encoded human genomes are compressed over 400 times at fast compression, or even 1000 times at slower compression (the reference genome itself needs much more space). Adding fast random access to text snippets decreases the ratio to ~300.
GDC is available at http://sun.aei.polsl.pl/gdc.
Supplementary data are available at Bioinformatics online.
由于 DNA 测序技术的快速发展以及测序数据的产生速度相应地不断加快,存储、传输和维护基因组数据库成为一项主要挑战。高效的压缩技术,支持对任何序列的任意片段进行提取,是维持这些大量数据的关键。
我们提出了一种 LZ77 风格的压缩方案,用于同一物种的多个基因组的相对压缩。虽然该解决方案与已知算法相似,但在压缩速度方面提供了显著更高的压缩比,超过一个数量级。特别是,在快速压缩下,69 个差异编码的人类基因组被压缩了 400 多倍,在较慢的压缩下甚至可以达到 1000 倍(参考基因组本身需要更多的空间)。添加对文本片段的快速随机访问会将比率降低到~300。
GDC 可在 http://sun.aei.polsl.pl/gdc 上获取。
补充数据可在 Bioinformatics 在线获取。