• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

CMIC:一种具有随机访问功能的高效质量得分压缩器。

CMIC: an efficient quality score compressor with random access functionality.

机构信息

School of Information, Yunnan University, Chenggong Campus, Kunming, Yunnan, China.

出版信息

BMC Bioinformatics. 2022 Jul 23;23(1):294. doi: 10.1186/s12859-022-04837-1.

DOI:10.1186/s12859-022-04837-1
PMID:35870880
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9308261/
Abstract

BACKGROUND

Over the past few decades, the emergence and maturation of new technologies have substantially reduced the cost of genome sequencing. As a result, the amount of genomic data that needs to be stored and transmitted has grown exponentially. For the standard sequencing data format, FASTQ, compression of the quality score is a key and difficult aspect of FASTQ file compression. Throughout the literature, we found that the majority of the current quality score compression methods do not support random access. Based on the above consideration, it is reasonable to investigate a lossless quality score compressor with a high compression rate, a fast compression and decompression speed, and support for random access.

RESULTS

In this paper, we propose CMIC, an adaptive and random access supported compressor for lossless compression of quality score sequences. CMIC is an acronym of the four steps (classification, mapping, indexing and compression) in the paper. Its framework consists of the following four parts: classification, mapping, indexing, and compression. The experimental results show that our compressor has good performance in terms of compression rates on all the tested datasets. The file sizes are reduced by up to 21.91% when compared with LCQS. In terms of compression speed, CMIC is better than all other compressors on most of the tested cases. In terms of random access speed, the CMIC is faster than the LCQS, which provides a random access function for compressed quality scores.

CONCLUSIONS

CMIC is a compressor that is especially designed for quality score sequences, which has good performance in terms of compression rate, compression speed, decompression speed, and random access speed. The CMIC can be obtained in the following way: https://github.com/Humonex/Cmic .

摘要

背景

在过去的几十年中,新技术的出现和成熟极大地降低了基因组测序的成本。因此,需要存储和传输的基因组数据量呈指数级增长。对于标准测序数据格式 FASTQ,质量分数的压缩是 FASTQ 文件压缩的一个关键且困难的方面。在整个文献中,我们发现大多数当前的质量分数压缩方法不支持随机访问。基于上述考虑,研究一种具有高压缩率、快速压缩和解压缩速度以及支持随机访问的无损质量分数压缩器是合理的。

结果

在本文中,我们提出了 CMIC,这是一种用于无损压缩质量分数序列的自适应和支持随机访问的压缩器。CMIC 是本文中四个步骤(分类、映射、索引和压缩)的缩写。它的框架由以下四个部分组成:分类、映射、索引和压缩。实验结果表明,我们的压缩器在所有测试数据集上的压缩率方面都具有良好的性能。与 LCQS 相比,文件大小最多减少了 21.91%。在压缩速度方面,CMIC 在大多数测试案例中都优于所有其他压缩器。在随机访问速度方面,CMIC 比提供压缩质量分数随机访问功能的 LCQS 更快。

结论

CMIC 是一种专门为质量分数序列设计的压缩器,在压缩率、压缩速度、解压缩速度和随机访问速度方面都具有良好的性能。CMIC 可以通过以下方式获得:https://github.com/Humonex/Cmic。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54a8/9308261/ae60e8797c0f/12859_2022_4837_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54a8/9308261/58fb0a70b86d/12859_2022_4837_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54a8/9308261/77b7a2b4c5e8/12859_2022_4837_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54a8/9308261/ae60e8797c0f/12859_2022_4837_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54a8/9308261/58fb0a70b86d/12859_2022_4837_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54a8/9308261/77b7a2b4c5e8/12859_2022_4837_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/54a8/9308261/ae60e8797c0f/12859_2022_4837_Fig3_HTML.jpg

相似文献

1
CMIC: an efficient quality score compressor with random access functionality.CMIC:一种具有随机访问功能的高效质量得分压缩器。
BMC Bioinformatics. 2022 Jul 23;23(1):294. doi: 10.1186/s12859-022-04837-1.
2
LCQS: an efficient lossless compression tool of quality scores with random access functionality.LCQS:一种具有随机访问功能的高效无损质量评分压缩工具。
BMC Bioinformatics. 2020 Mar 18;21(1):109. doi: 10.1186/s12859-020-3428-7.
3
FCLQC: fast and concurrent lossless quality scores compressor.FCLQC:快速并发无损质量评分压缩器。
BMC Bioinformatics. 2021 Dec 20;22(1):606. doi: 10.1186/s12859-021-04516-7.
4
AQUa: an adaptive framework for compression of sequencing quality scores with random access functionality.AQUa:一种具有随机访问功能的测序质量分数自适应压缩框架。
Bioinformatics. 2018 Feb 1;34(3):425-433. doi: 10.1093/bioinformatics/btx607.
5
RENANO: a REference-based compressor for NANOpore FASTQ files.RENANO:一种基于参考的 Nanopore FASTQ 文件压缩工具。
Bioinformatics. 2021 Dec 11;37(24):4862-4864. doi: 10.1093/bioinformatics/btab437.
6
SPRING: a next-generation compressor for FASTQ data.SPRING:FASTQ 数据的下一代压缩程序。
Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.
7
PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.PQSDC:一种通过序列划分和游程长度预测映射对质量分数数据进行并行无损压缩的方法。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae323.
8
QualComp: a new lossy compressor for quality scores based on rate distortion theory.QualComp:一种基于率失真理论的新的基于质量分数的有损压缩器。
BMC Bioinformatics. 2013 Jun 8;14:187. doi: 10.1186/1471-2105-14-187.
9
Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences.核苷酸档案格式 (NAF) 可实现 DNA 序列的高效无损、无参考自由压缩。
Bioinformatics. 2019 Oct 1;35(19):3826-3828. doi: 10.1093/bioinformatics/btz144.
10
LFastqC: A lossless non-reference-based FASTQ compressor.LFastqC:一种无损的非参考型 FASTQ 压缩器。
PLoS One. 2019 Nov 14;14(11):e0224806. doi: 10.1371/journal.pone.0224806. eCollection 2019.

引用本文的文献

1
PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.PQSDC:一种通过序列划分和游程长度预测映射对质量分数数据进行并行无损压缩的方法。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae323.
2
Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity.基于稀疏性不对称压缩增强基因组突变数据存储优化
Front Genet. 2023 Jun 1;14:1213907. doi: 10.3389/fgene.2023.1213907. eCollection 2023.

本文引用的文献

1
FCLQC: fast and concurrent lossless quality scores compressor.FCLQC:快速并发无损质量评分压缩器。
BMC Bioinformatics. 2021 Dec 20;22(1):606. doi: 10.1186/s12859-021-04516-7.
2
LCQS: an efficient lossless compression tool of quality scores with random access functionality.LCQS:一种具有随机访问功能的高效无损质量评分压缩工具。
BMC Bioinformatics. 2020 Mar 18;21(1):109. doi: 10.1186/s12859-020-3428-7.
3
Random access in large-scale DNA data storage.大规模 DNA 数据存储中的随机访问。
Nat Biotechnol. 2018 Mar;36(3):242-248. doi: 10.1038/nbt.4079. Epub 2018 Feb 19.
4
AQUa: an adaptive framework for compression of sequencing quality scores with random access functionality.AQUa:一种具有随机访问功能的测序质量分数自适应压缩框架。
Bioinformatics. 2018 Feb 1;34(3):425-433. doi: 10.1093/bioinformatics/btx607.
5
An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values.基因组测序质量值有损压缩的评估框架
Proc Data Compress Conf. 2016 Mar-Apr;2016:221-230. doi: 10.1109/DCC.2016.39. Epub 2016 Dec 19.
6
LW-FQZip 2: a parallelized reference-based compression of FASTQ files.LW-FQZip 2:FASTQ文件的并行化基于参考的压缩
BMC Bioinformatics. 2017 Mar 20;18(1):179. doi: 10.1186/s12859-017-1588-x.
7
AFRESh: an adaptive framework for compression of reads and assembled sequences with random access functionality.AFRESh:一种具有随机访问功能的用于压缩读取数据和组装序列的自适应框架。
Bioinformatics. 2017 May 15;33(10):1464-1472. doi: 10.1093/bioinformatics/btx001.
8
CSAM: Compressed SAM format.CSAM:压缩 SAM 格式。
Bioinformatics. 2016 Dec 15;32(24):3709-3716. doi: 10.1093/bioinformatics/btw543. Epub 2016 Aug 18.
9
CARGO: effective format-free compressed storage of genomic information.CARGO:基因组信息的有效无格式压缩存储。
Nucleic Acids Res. 2016 Jul 8;44(12):e114. doi: 10.1093/nar/gkw318. Epub 2016 Apr 29.
10
LFQC: a lossless compression algorithm for FASTQ files.LFQC:一种用于FASTQ文件的无损压缩算法。
Bioinformatics. 2015 Oct 15;31(20):3276-81. doi: 10.1093/bioinformatics/btv384. Epub 2015 Jun 20.