Suppr超能文献

LCQS:一种具有随机访问功能的高效无损质量评分压缩工具。

LCQS: an efficient lossless compression tool of quality scores with random access functionality.

机构信息

School of Computer Science & Engineering, South China University of Technology, Wushan Road, Guangzhou, 510006, China.

Communication & Computer Network Lab of Guangdong, South China University of Technology, Wushan Road, Guangzhou, 510006, China.

出版信息

BMC Bioinformatics. 2020 Mar 18;21(1):109. doi: 10.1186/s12859-020-3428-7.

Abstract

BACKGROUND

Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult part of the standard sequencing data format FASTQ, compression of the quality score has become a conundrum in the development of FASTQ compression. Existing lossless compressors of quality scores mainly utilize specific patterns generated by specific sequencer and complex context modeling techniques to solve the problem of low compression ratio. However, the main drawbacks of these compressors are the problem of weak robustness which means unstable or even unavailable results of sequencing files and the problem of slow compression speed. Meanwhile, some compressors attempt to construct a fine-grained index structure to solve the problem of slow random access decompression speed. However, they solve the problem at the sacrifice of compression speed and at the expense of large index files, which makes them inefficient and impractical. Therefore, an efficient lossless compressor of quality scores with strong robustness, high compression ratio, fast compression and random access decompression speed is urgently needed and of great significance.

RESULTS

In this paper, based on the idea of maximizing the use of hardware resources, LCQS, a lossless compression tool specialized for quality scores, was proposed. It consists of four sequential processing steps: partitioning, indexing, packing and parallelizing. Experimental results reveal that LCQS outperforms all the other state-of-the-art compressors on all criteria except for the compression speed on the dataset SRR1284073. Furthermore, LCQS presents strong robustness on all the test datasets, with its acceleration ratios of compression speed increasing by up to 29.1x, its file size reducing by up to 28.78%, and its random access decompression speed increasing by up to 2.1x. Additionally, LCQS also exhibits strong scalability. That is, the compression speed increases almost linearly as the size of input dataset increases.

CONCLUSION

The ability to handle all different kinds of quality scores and superiority in compression ratio and compression speed make LCQS a high-efficient and advanced lossless quality score compressor, along with its strength of fast random access decompression. Our tool LCQS can be downloaded from https://github.com/SCUT-CCNL/LCQSand freely available for non-commercial usage.

摘要

背景

先进的测序仪器极大地加快了基因组数据的生成速度,这使得有效压缩测序数据的需求变得极其迫切和重要。作为标准测序数据格式 FASTQ 中最困难的部分,质量分数的压缩已成为 FASTQ 压缩发展中的难题。现有的质量分数无损压缩器主要利用特定测序仪生成的特定模式和复杂的上下文建模技术来解决低压缩比的问题。然而,这些压缩器的主要缺点是稳健性差,这意味着测序文件的结果不稳定甚至不可用,以及压缩速度慢。同时,一些压缩器试图构建细粒度的索引结构来解决随机访问解压缩速度慢的问题。然而,他们解决问题的代价是牺牲压缩速度和大的索引文件,这使得它们效率低下且不切实际。因此,迫切需要一种高效的、稳健的、高压缩比、快速压缩和随机访问解压缩速度的质量分数无损压缩器,这具有重要意义。

结果

在本文中,基于最大限度利用硬件资源的思想,提出了一种专门用于质量分数的无损压缩工具 LCQS。它由四个顺序处理步骤组成:分区、索引、打包和并行化。实验结果表明,LCQS 在所有标准上都优于所有其他最先进的压缩器,除了在数据集 SRR1284073 上的压缩速度。此外,LCQS 在所有测试数据集上都表现出很强的稳健性,其压缩速度的加速比最高可达 29.1 倍,文件大小减少最多可达 28.78%,随机访问解压缩速度最高可达 2.1 倍。此外,LCQS 还具有很强的可扩展性。也就是说,随着输入数据集大小的增加,压缩速度几乎呈线性增长。

结论

LCQS 能够处理所有不同类型的质量分数,在压缩比和压缩速度方面具有优势,是一种高效、先进的无损质量分数压缩器,同时具有快速随机访问解压缩的优势。我们的工具 LCQS 可以从 https://github.com/SCUT-CCNL/LCQS 下载,并可免费用于非商业用途。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a149/7079445/ebb2419793c6/12859_2020_3428_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验