• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基因组数据中质量分数的有损压缩。

Lossy compression of quality scores in genomic data.

机构信息

NICTA Victoria Research Laboratory, Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

出版信息

Bioinformatics. 2014 Aug 1;30(15):2130-6. doi: 10.1093/bioinformatics/btu183. Epub 2014 Apr 10.

DOI:10.1093/bioinformatics/btu183
PMID:24728856
Abstract

MOTIVATION

Next-generation sequencing technologies are revolutionizing medicine. Data from sequencing technologies are typically represented as a string of bases, an associated sequence of per-base quality scores and other metadata, and in aggregate can require a large amount of space. The quality scores show how accurate the bases are with respect to the sequencing process, that is, how confident the sequencer is of having called them correctly, and are the largest component in datasets in which they are retained. Previous research has examined how to store sequences of bases effectively; here we add to that knowledge by examining methods for compressing quality scores. The quality values originate in a continuous domain, and so if a fidelity criterion is introduced, it is possible to introduce flexibility in the way these values are represented, allowing lossy compression over the quality score data.

RESULTS

We present existing compression options for quality score data, and then introduce two new lossy techniques. Experiments measuring the trade-off between compression ratio and information loss are reported, including quantifying the effect of lossy representations on a downstream application that carries out single nucleotide polymorphism and insert/deletion detection. The new methods are demonstrably superior to other techniques when assessed against the spectrum of possible trade-offs between storage required and fidelity of representation.

AVAILABILITY AND IMPLEMENTATION

An implementation of the methods described here is available at https://github.com/rcanovas/libCSAM.

CONTACT

rcanovas@student.unimelb.edu.au

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

下一代测序技术正在彻底改变医学。测序技术产生的数据通常表示为碱基的字符串,以及碱基质量得分的序列和其他元数据,这些数据加起来可能需要大量的空间。质量得分显示了碱基相对于测序过程的准确性,也就是说,测序仪对其正确识别的置信度,并且是保留这些得分的数据集的最大组成部分。以前的研究已经研究了如何有效地存储碱基序列;在这里,我们通过研究压缩质量得分的方法来扩展这方面的知识。质量值源自连续域,因此,如果引入保真度标准,则可以在表示这些值的方式上引入灵活性,从而可以对质量得分数据进行有损压缩。

结果

我们介绍了质量得分数据的现有压缩选项,然后引入了两种新的有损技术。报告了测量压缩比和信息丢失之间权衡的实验,包括量化有损表示对下游应用程序的影响,该应用程序执行单核苷酸多态性和插入/缺失检测。与存储要求和表示保真度之间可能的权衡范围相比,新方法在评估时明显优于其他技术。

可用性和实现

此处描述的方法的实现可在 https://github.com/rcanovas/libCSAM 上获得。

联系方式

rcanovas@student.unimelb.edu.au

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

1
Lossy compression of quality scores in genomic data.基因组数据中质量分数的有损压缩。
Bioinformatics. 2014 Aug 1;30(15):2130-6. doi: 10.1093/bioinformatics/btu183. Epub 2014 Apr 10.
2
CSAM: Compressed SAM format.CSAM:压缩 SAM 格式。
Bioinformatics. 2016 Dec 15;32(24):3709-3716. doi: 10.1093/bioinformatics/btw543. Epub 2016 Aug 18.
3
FaStore: a space-saving solution for raw sequencing data.FaStore:一种节省存储空间的原始测序数据解决方案。
Bioinformatics. 2018 Aug 15;34(16):2748-2756. doi: 10.1093/bioinformatics/bty205.
4
Performance evaluation of lossy quality compression algorithms for RNA-seq data.RNA-seq 数据有损质量压缩算法的性能评估。
BMC Bioinformatics. 2020 Jul 20;21(1):321. doi: 10.1186/s12859-020-03658-4.
5
CALQ: compression of quality values of aligned sequencing data.CALQ:对齐测序数据的质量值压缩。
Bioinformatics. 2018 May 15;34(10):1650-1658. doi: 10.1093/bioinformatics/btx737.
6
QualComp: a new lossy compressor for quality scores based on rate distortion theory.QualComp:一种基于率失真理论的新的基于质量分数的有损压缩器。
BMC Bioinformatics. 2013 Jun 8;14:187. doi: 10.1186/1471-2105-14-187.
7
A Two-Level Scheme for Quality Score Compression.一种用于质量分数压缩的两级方案。
J Comput Biol. 2018 Oct;25(10):1141-1151. doi: 10.1089/cmb.2018.0065. Epub 2018 Jul 30.
8
ScaleQC: a scalable lossy to lossless solution for NGS data compression.ScaleQC:一种用于 NGS 数据压缩的可扩展有损到无损解决方案。
Bioinformatics. 2020 Nov 1;36(17):4551-4559. doi: 10.1093/bioinformatics/btaa543.
9
AQUa: an adaptive framework for compression of sequencing quality scores with random access functionality.AQUa:一种具有随机访问功能的测序质量分数自适应压缩框架。
Bioinformatics. 2018 Feb 1;34(3):425-433. doi: 10.1093/bioinformatics/btx607.
10
Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.基于哈希的重排序压缩基因组测序reads:算法与分析。
Bioinformatics. 2018 Feb 15;34(4):558-567. doi: 10.1093/bioinformatics/btx639.

引用本文的文献

1
PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.PQSDC:一种通过序列划分和游程长度预测映射对质量分数数据进行并行无损压缩的方法。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae323.
2
Efficient sequencing data compression and FPGA acceleration based on a two-step framework.基于两步框架的高效测序数据压缩与现场可编程门阵列加速
Front Genet. 2023 Sep 21;14:1260531. doi: 10.3389/fgene.2023.1260531. eCollection 2023.
3
Nanopore quality score resolution can be reduced with little effect on downstream analysis.
纳米孔质量得分分辨率可以降低,而对下游分析的影响很小。
Bioinform Adv. 2022 Aug 11;2(1):vbac054. doi: 10.1093/bioadv/vbac054. eCollection 2022.
4
CMIC: an efficient quality score compressor with random access functionality.CMIC:一种具有随机访问功能的高效质量得分压缩器。
BMC Bioinformatics. 2022 Jul 23;23(1):294. doi: 10.1186/s12859-022-04837-1.
5
ACO:lossless quality score compression based on adaptive coding order.ACO:基于自适应编码顺序的无损质量评分压缩。
BMC Bioinformatics. 2022 Jun 7;23(1):219. doi: 10.1186/s12859-022-04712-z.
6
IonCRAM: a reference-based compression tool for ion torrent sequence files.IonCRAM:一种基于参考的 Ion Torrent 测序文件压缩工具。
BMC Bioinformatics. 2020 Sep 9;21(1):397. doi: 10.1186/s12859-020-03726-9.
7
Performance evaluation of lossy quality compression algorithms for RNA-seq data.RNA-seq 数据有损质量压缩算法的性能评估。
BMC Bioinformatics. 2020 Jul 20;21(1):321. doi: 10.1186/s12859-020-03658-4.
8
Better quality score compression through sequence-based quality smoothing.通过基于序列的质量平滑提高质量得分压缩效果。
BMC Bioinformatics. 2019 Nov 22;20(Suppl 9):302. doi: 10.1186/s12859-019-2883-5.
9
Denoising of Aligned Genomic Data.对齐基因组数据的去噪。
Sci Rep. 2019 Oct 21;9(1):15067. doi: 10.1038/s41598-019-51418-z.
10
Crumble: reference free lossy compression of sequence quality values.Crumble:序列质量值的无参考有损压缩。
Bioinformatics. 2019 Jan 15;35(2):337-339. doi: 10.1093/bioinformatics/bty608.