• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

LCQS:一种具有随机访问功能的高效无损质量评分压缩工具。

LCQS: an efficient lossless compression tool of quality scores with random access functionality.

机构信息

School of Computer Science & Engineering, South China University of Technology, Wushan Road, Guangzhou, 510006, China.

Communication & Computer Network Lab of Guangdong, South China University of Technology, Wushan Road, Guangzhou, 510006, China.

出版信息

BMC Bioinformatics. 2020 Mar 18;21(1):109. doi: 10.1186/s12859-020-3428-7.

DOI:10.1186/s12859-020-3428-7
PMID:32183707
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7079445/
Abstract

BACKGROUND

Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult part of the standard sequencing data format FASTQ, compression of the quality score has become a conundrum in the development of FASTQ compression. Existing lossless compressors of quality scores mainly utilize specific patterns generated by specific sequencer and complex context modeling techniques to solve the problem of low compression ratio. However, the main drawbacks of these compressors are the problem of weak robustness which means unstable or even unavailable results of sequencing files and the problem of slow compression speed. Meanwhile, some compressors attempt to construct a fine-grained index structure to solve the problem of slow random access decompression speed. However, they solve the problem at the sacrifice of compression speed and at the expense of large index files, which makes them inefficient and impractical. Therefore, an efficient lossless compressor of quality scores with strong robustness, high compression ratio, fast compression and random access decompression speed is urgently needed and of great significance.

RESULTS

In this paper, based on the idea of maximizing the use of hardware resources, LCQS, a lossless compression tool specialized for quality scores, was proposed. It consists of four sequential processing steps: partitioning, indexing, packing and parallelizing. Experimental results reveal that LCQS outperforms all the other state-of-the-art compressors on all criteria except for the compression speed on the dataset SRR1284073. Furthermore, LCQS presents strong robustness on all the test datasets, with its acceleration ratios of compression speed increasing by up to 29.1x, its file size reducing by up to 28.78%, and its random access decompression speed increasing by up to 2.1x. Additionally, LCQS also exhibits strong scalability. That is, the compression speed increases almost linearly as the size of input dataset increases.

CONCLUSION

The ability to handle all different kinds of quality scores and superiority in compression ratio and compression speed make LCQS a high-efficient and advanced lossless quality score compressor, along with its strength of fast random access decompression. Our tool LCQS can be downloaded from https://github.com/SCUT-CCNL/LCQSand freely available for non-commercial usage.

摘要

背景

先进的测序仪器极大地加快了基因组数据的生成速度,这使得有效压缩测序数据的需求变得极其迫切和重要。作为标准测序数据格式 FASTQ 中最困难的部分,质量分数的压缩已成为 FASTQ 压缩发展中的难题。现有的质量分数无损压缩器主要利用特定测序仪生成的特定模式和复杂的上下文建模技术来解决低压缩比的问题。然而,这些压缩器的主要缺点是稳健性差,这意味着测序文件的结果不稳定甚至不可用,以及压缩速度慢。同时,一些压缩器试图构建细粒度的索引结构来解决随机访问解压缩速度慢的问题。然而,他们解决问题的代价是牺牲压缩速度和大的索引文件,这使得它们效率低下且不切实际。因此,迫切需要一种高效的、稳健的、高压缩比、快速压缩和随机访问解压缩速度的质量分数无损压缩器,这具有重要意义。

结果

在本文中,基于最大限度利用硬件资源的思想,提出了一种专门用于质量分数的无损压缩工具 LCQS。它由四个顺序处理步骤组成:分区、索引、打包和并行化。实验结果表明,LCQS 在所有标准上都优于所有其他最先进的压缩器,除了在数据集 SRR1284073 上的压缩速度。此外,LCQS 在所有测试数据集上都表现出很强的稳健性,其压缩速度的加速比最高可达 29.1 倍,文件大小减少最多可达 28.78%,随机访问解压缩速度最高可达 2.1 倍。此外,LCQS 还具有很强的可扩展性。也就是说,随着输入数据集大小的增加,压缩速度几乎呈线性增长。

结论

LCQS 能够处理所有不同类型的质量分数,在压缩比和压缩速度方面具有优势,是一种高效、先进的无损质量分数压缩器,同时具有快速随机访问解压缩的优势。我们的工具 LCQS 可以从 https://github.com/SCUT-CCNL/LCQS 下载,并可免费用于非商业用途。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a149/7079445/1578722bd15d/12859_2020_3428_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a149/7079445/ebb2419793c6/12859_2020_3428_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a149/7079445/7c5d3a0bc105/12859_2020_3428_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a149/7079445/1578722bd15d/12859_2020_3428_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a149/7079445/ebb2419793c6/12859_2020_3428_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a149/7079445/7c5d3a0bc105/12859_2020_3428_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a149/7079445/1578722bd15d/12859_2020_3428_Fig3_HTML.jpg

相似文献

1
LCQS: an efficient lossless compression tool of quality scores with random access functionality.LCQS:一种具有随机访问功能的高效无损质量评分压缩工具。
BMC Bioinformatics. 2020 Mar 18;21(1):109. doi: 10.1186/s12859-020-3428-7.
2
FCLQC: fast and concurrent lossless quality scores compressor.FCLQC:快速并发无损质量评分压缩器。
BMC Bioinformatics. 2021 Dec 20;22(1):606. doi: 10.1186/s12859-021-04516-7.
3
CMIC: an efficient quality score compressor with random access functionality.CMIC:一种具有随机访问功能的高效质量得分压缩器。
BMC Bioinformatics. 2022 Jul 23;23(1):294. doi: 10.1186/s12859-022-04837-1.
4
SPRING: a next-generation compressor for FASTQ data.SPRING:FASTQ 数据的下一代压缩程序。
Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.
5
AQUa: an adaptive framework for compression of sequencing quality scores with random access functionality.AQUa:一种具有随机访问功能的测序质量分数自适应压缩框架。
Bioinformatics. 2018 Feb 1;34(3):425-433. doi: 10.1093/bioinformatics/btx607.
6
LFastqC: A lossless non-reference-based FASTQ compressor.LFastqC:一种无损的非参考型 FASTQ 压缩器。
PLoS One. 2019 Nov 14;14(11):e0224806. doi: 10.1371/journal.pone.0224806. eCollection 2019.
7
PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.PQSDC:一种通过序列划分和游程长度预测映射对质量分数数据进行并行无损压缩的方法。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae323.
8
WBFQC: A new approach for compressing next-generation sequencing data splitting into homogeneous streams.WBFQC:一种将下一代测序数据分割为同质流进行压缩的新方法。
J Bioinform Comput Biol. 2018 Oct;16(5):1850018. doi: 10.1142/S021972001850018X. Epub 2018 Jun 28.
9
Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences.核苷酸档案格式 (NAF) 可实现 DNA 序列的高效无损、无参考自由压缩。
Bioinformatics. 2019 Oct 1;35(19):3826-3828. doi: 10.1093/bioinformatics/btz144.
10
GTZ: a fast compression and cloud transmission tool optimized for FASTQ files.GTZ:一款针对 FASTQ 文件优化的快速压缩和云传输工具。
BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):549. doi: 10.1186/s12859-017-1973-5.

引用本文的文献

1
PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.PQSDC:一种通过序列划分和游程长度预测映射对质量分数数据进行并行无损压缩的方法。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae323.
2
Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity.基于稀疏性不对称压缩增强基因组突变数据存储优化
Front Genet. 2023 Jun 1;14:1213907. doi: 10.3389/fgene.2023.1213907. eCollection 2023.
3
CMIC: an efficient quality score compressor with random access functionality.

本文引用的文献

1
A cluster-based approach to compression of Quality Scores.一种基于聚类的质量分数压缩方法。
Proc Data Compress Conf. 2016 Mar-Apr;2016:261-270. doi: 10.1109/DCC.2016.49. Epub 2016 Dec 19.
2
AQUa: an adaptive framework for compression of sequencing quality scores with random access functionality.AQUa:一种具有随机访问功能的测序质量分数自适应压缩框架。
Bioinformatics. 2018 Feb 1;34(3):425-433. doi: 10.1093/bioinformatics/btx607.
3
LW-FQZip 2: a parallelized reference-based compression of FASTQ files.LW-FQZip 2:FASTQ文件的并行化基于参考的压缩
CMIC:一种具有随机访问功能的高效质量得分压缩器。
BMC Bioinformatics. 2022 Jul 23;23(1):294. doi: 10.1186/s12859-022-04837-1.
4
FCLQC: fast and concurrent lossless quality scores compressor.FCLQC:快速并发无损质量评分压缩器。
BMC Bioinformatics. 2021 Dec 20;22(1):606. doi: 10.1186/s12859-021-04516-7.
BMC Bioinformatics. 2017 Mar 20;18(1):179. doi: 10.1186/s12859-017-1588-x.
4
Comparison of high-throughput sequencing data compression tools.高通量测序数据压缩工具比较。
Nat Methods. 2016 Dec;13(12):1005-1008. doi: 10.1038/nmeth.4037. Epub 2016 Oct 24.
5
Effect of lossy compression of quality scores on variant calling.质量分数的有损压缩对变异检测的影响。
Brief Bioinform. 2017 Mar 1;18(2):183-194. doi: 10.1093/bib/bbw011.
6
LFQC: a lossless compression algorithm for FASTQ files.LFQC:一种用于FASTQ文件的无损压缩算法。
Bioinformatics. 2015 Oct 15;31(20):3276-81. doi: 10.1093/bioinformatics/btv384. Epub 2015 Jun 20.
7
QVZ: lossy compression of quality values.QVZ:质量值的有损压缩。
Bioinformatics. 2015 Oct 1;31(19):3122-9. doi: 10.1093/bioinformatics/btv330. Epub 2015 May 28.
8
Compression of FASTQ and SAM format sequencing data.FASTQ 和 SAM 格式测序数据的压缩。
PLoS One. 2013;8(3):e59190. doi: 10.1371/journal.pone.0059190. Epub 2013 Mar 22.
9
SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE:使用局部一致编码提升序列压缩算法。
Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.