Suppr超能文献

数据依赖分桶法可改善测序读数的无参考压缩。

Data-dependent bucketing improves reference-free compression of sequencing reads.

作者信息

Patro Rob, Kingsford Carl

机构信息

Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA and.

Department Computational Biology, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA.

出版信息

Bioinformatics. 2015 Sep 1;31(17):2770-7. doi: 10.1093/bioinformatics/btv248. Epub 2015 Apr 24.

Abstract

MOTIVATION

The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data.

RESULTS

We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes.

AVAILABILITY AND IMPLEMENTATION

Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/∼ckingsf/software/mince.

CONTACT

carlk@cs.cmu.edu

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

高通量测序数据的存储和传输消耗大量资源。随着我们生成此类数据的能力不断提高,这种负担只会越来越重。减少存储和传输需求的一种方法是压缩这种测序数据。

结果

我们提出了一种新颖的技术来提高测序数据的压缩率,该技术基于对相似读段进行分组的概念,以便它们在文件中相邻出现。我们证明,通过采用数据依赖的分组方案并运用多种编码思路,我们能够实现比现有从头测序压缩工具(包括其他分组和重排方案)显著更高的压缩率。我们的方法Mince与现有的最先进的从头压缩方案相比,文件大小减少了高达45%(平均减少28%)。

可用性与实现

Mince用C++11编写,是开源的,已根据GPLv3许可发布。可在http://www.cs.cmu.edu/∼ckingsf/software/mince获取。

联系方式

carlk@cs.cmu.edu

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a4b8/4547610/5a4f57bed8af/btv248f1p.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验