Suppr超能文献

具有随机访问的基因组的稳健相对压缩。

Robust relative compression of genomes with random access.

机构信息

Institute of Informatics, Silesian University of Technology, 44-100 Gliwice, Poland.

出版信息

Bioinformatics. 2011 Nov 1;27(21):2979-86. doi: 10.1093/bioinformatics/btr505. Epub 2011 Sep 5.

Abstract

MOTIVATION

Storing, transferring and maintaining genomic databases becomes a major challenge because of the rapid technology progress in DNA sequencing and correspondingly growing pace at which the sequencing data are being produced. Efficient compression, with support for extraction of arbitrary snippets of any sequence, is the key to maintaining those huge amounts of data.

RESULTS

We present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over an order of magnitude greater. In particular, 69 differentially encoded human genomes are compressed over 400 times at fast compression, or even 1000 times at slower compression (the reference genome itself needs much more space). Adding fast random access to text snippets decreases the ratio to ~300.

AVAILABILITY

GDC is available at http://sun.aei.polsl.pl/gdc.

CONTACT

sebastian.deorowicz@polsl.pl.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

由于 DNA 测序技术的快速发展以及测序数据的产生速度相应地不断加快,存储、传输和维护基因组数据库成为一项主要挑战。高效的压缩技术,支持对任何序列的任意片段进行提取,是维持这些大量数据的关键。

结果

我们提出了一种 LZ77 风格的压缩方案,用于同一物种的多个基因组的相对压缩。虽然该解决方案与已知算法相似,但在压缩速度方面提供了显著更高的压缩比,超过一个数量级。特别是,在快速压缩下,69 个差异编码的人类基因组被压缩了 400 多倍,在较慢的压缩下甚至可以达到 1000 倍(参考基因组本身需要更多的空间)。添加对文本片段的快速随机访问会将比率降低到~300。

可用性

GDC 可在 http://sun.aei.polsl.pl/gdc 上获取。

联系方式

sebastian.deorowicz@polsl.pl

补充信息

补充数据可在 Bioinformatics 在线获取。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验