Suppr超能文献

KungFQ:一种简单而强大的压缩 fastq 文件的方法。

KungFQ: a simple and powerful approach to compress fastq files.

机构信息

Department of Genetics, Biology and Biochemistry, Molecular Biotechnology Center, via Nizza 52, Torino 10126, Italy.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2012 Nov-Dec;9(6):1837-42. doi: 10.1109/TCBB.2012.123.

Abstract

Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels.

摘要

如今,存储源自深度测序实验的数据已变得至关重要,而标准的压缩算法并不能令人满意地利用其结构。已经开发了许多基于参考的压缩算法,但当涉及没有完全测序基因组或非基因组数据的新物种时,它们就不太适用了。我们开发了一种工具,利用 fastq 的特点,并将其编码成二进制格式,以便进一步使用标准工具(如 gzip 或 lzma)进行压缩。该算法简单直接,不需要任何外部参考文件,只需扫描 fastq 一次,且具有恒定的内存需求。此外,我们还添加了执行有损压缩的功能,在不丢失原始信息(ID 和/或质量)的情况下,会导致文件更小;也可以定义一个质量截断值,低于该值的相应碱基调用将被转换为 N。在不丢失信息的情况下,我们在各种 fastq 文件上实现了 2.82 到 7.77 的压缩比,在丢失 ID 的情况下实现了 5.37 到 8.77 的压缩比,而这些 ID 通常在常见的分析管道中不使用。在本文中,我们将算法性能与已知工具进行了比较,通常可以获得更高的压缩水平。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验