• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于增强推理和减少存储的质量得分去噪

Denoising of Quality Scores for Boosted Inference and Reduced Storage.

作者信息

Ochoa Idoia, Hernaez Mikel, Goldfeder Rachel, Weissman Tsachy, Ashley Euan

机构信息

Department of Electrical Engineering, Stanford University, Stanford, CA, 94305.

Department of Medicine, Stanford University, Stanford, CA, 94305.

出版信息

Proc Data Compress Conf. 2016 Mar-Apr;2016:251-260. doi: 10.1109/DCC.2016.92. Epub 2016 Dec 19.

DOI:10.1109/DCC.2016.92
PMID:29098178
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5663231/
Abstract

Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Much of the raw data are comprised of nucleotides and the corresponding quality scores that indicate their reliability. The latter are more difficult to compress and are themselves noisy. Lossless and lossy compression of the quality scores has recently been proposed to alleviate the storage costs, but reducing the noise in the quality scores has remained largely unexplored. This raw data is processed in order to identify variants; these genetic variants are used in important applications, such as medical decision making. Thus improving the performance of the variant calling by reducing the noise contained in the quality scores is important. We propose a denoising scheme that reduces the noise of the quality scores and we demonstrate improved inference with this denoised data. Specifically, we show that replacing the quality scores with those generated by the proposed denoiser results in more accurate variant calling in general. Moreover, a consequence of the denoising is that the entropy of the produced quality scores is smaller, and thus significant compression can be achieved with respect to lossless compression of the original quality scores. We expect our results to provide a baseline for future research in denoising of quality scores. The code used in this work as well as a Supplement with all the results are available at http://web.stanford.edu/~iochoa/DCCdenoiser_CodeAndSupplement.zip.

摘要

由于测序技术的进步和测序成本的大幅下降,大量的测序数据正在生成。许多原始数据由核苷酸以及表明其可靠性的相应质量得分组成。后者更难压缩且本身存在噪声。最近有人提出对质量得分进行无损和有损压缩以减轻存储成本,但在降低质量得分中的噪声方面,很大程度上仍未得到充分探索。处理这些原始数据是为了识别变异;这些基因变异被用于重要应用,如医疗决策。因此,通过降低质量得分中包含的噪声来提高变异检测的性能很重要。我们提出了一种去噪方案,该方案可降低质量得分的噪声,并证明使用这种去噪后的数据能改进推理。具体而言,我们表明,一般来说,用所提出的去噪器生成的质量得分替换原来的质量得分会导致更准确的变异检测。此外,去噪的一个结果是所产生的质量得分的熵更小,因此相对于原始质量得分的无损压缩,可以实现显著的压缩。我们期望我们的结果能为未来质量得分去噪研究提供一个基线。这项工作中使用的代码以及包含所有结果的补充材料可在http://web.stanford.edu/~iochoa/DCCdenoiser_CodeAndSupplement.zip获取。

相似文献

1
Denoising of Quality Scores for Boosted Inference and Reduced Storage.用于增强推理和减少存储的质量得分去噪
Proc Data Compress Conf. 2016 Mar-Apr;2016:251-260. doi: 10.1109/DCC.2016.92. Epub 2016 Dec 19.
2
CROMqs: an infinitesimal successive refinement lossy compressor for the quality scores.CROMqs:一种用于质量分数的无穷小逐次细化有损压缩器。
Proc Inf Theory Workshop. 2016 Sep;2016:121-125. doi: 10.1109/ITW.2016.7606808. Epub 2016 Oct 27.
3
A cluster-based approach to compression of Quality Scores.一种基于聚类的质量分数压缩方法。
Proc Data Compress Conf. 2016 Mar-Apr;2016:261-270. doi: 10.1109/DCC.2016.49. Epub 2016 Dec 19.
4
Effect of lossy compression of quality scores on variant calling.质量分数的有损压缩对变异检测的影响。
Brief Bioinform. 2017 Mar 1;18(2):183-194. doi: 10.1093/bib/bbw011.
5
CROMqs: An infinitesimal successive refinement lossy compressor for the quality scores.CROMqs:用于质量分数的无穷小连续细化有损压缩器。
J Bioinform Comput Biol. 2020 Dec;18(6):2050031. doi: 10.1142/S0219720020500316. Epub 2020 Sep 16.
6
QVZ: lossy compression of quality values.QVZ:质量值的有损压缩。
Bioinformatics. 2015 Oct 1;31(19):3122-9. doi: 10.1093/bioinformatics/btv330. Epub 2015 May 28.
7
FCLQC: fast and concurrent lossless quality scores compressor.FCLQC:快速并发无损质量评分压缩器。
BMC Bioinformatics. 2021 Dec 20;22(1):606. doi: 10.1186/s12859-021-04516-7.
8
smallWig: parallel compression of RNA-seq WIG files.smallWig:RNA序列WIG文件的并行压缩
Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30.
9
A Two-Level Scheme for Quality Score Compression.一种用于质量分数压缩的两级方案。
J Comput Biol. 2018 Oct;25(10):1141-1151. doi: 10.1089/cmb.2018.0065. Epub 2018 Jul 30.
10
Impact of lossy compression of nanopore raw signal data on basecalling and consensus accuracy.纳米孔原始信号数据的有损压缩对碱基识别和一致性准确性的影响。
Bioinformatics. 2021 Apr 1;36(22-23):5313-5321. doi: 10.1093/bioinformatics/btaa1017.

本文引用的文献

1
Network Compression: Worst Case Analysis.网络压缩:最坏情况分析。
IEEE Trans Inf Theory. 2015 Jul;61(7):3980-3995. doi: 10.1109/tit.2015.2434829. Epub 2015 Jun 12.
2
Traversing the -mer Landscape of NGS Read Datasets for Quality Score Sparsification.遍历用于质量得分稀疏化的NGS读取数据集的-mer格局
Res Comput Mol Biol. 2014 Apr;8394:385-399. doi: 10.1007/978-3-319-05269-4_31.
3
Effect of lossy compression of quality scores on variant calling.质量分数的有损压缩对变异检测的影响。
Brief Bioinform. 2017 Mar 1;18(2):183-194. doi: 10.1093/bib/bbw011.
4
QVZ: lossy compression of quality values.QVZ:质量值的有损压缩。
Bioinformatics. 2015 Oct 1;31(19):3122-9. doi: 10.1093/bioinformatics/btv330. Epub 2015 May 28.
5
From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.从FastQ数据到高可信度变异检测:基因组分析工具包最佳实践流程
Curr Protoc Bioinformatics. 2013;43(1110):11.10.1-11.10.33. doi: 10.1002/0471250953.bi1110s43.
6
Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications.整合基于图谱、组装和单倍型的方法以在临床测序应用中进行变异检测。
Nat Genet. 2014 Aug;46(8):912-918. doi: 10.1038/ng.3036. Epub 2014 Jul 13.
7
Lossy compression of quality scores in genomic data.基因组数据中质量分数的有损压缩。
Bioinformatics. 2014 Aug 1;30(15):2130-6. doi: 10.1093/bioinformatics/btu183. Epub 2014 Apr 10.
8
Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls.整合人类序列数据集提供了一个基准 SNP 和 indel 基因型调用资源。
Nat Biotechnol. 2014 Mar;32(3):246-51. doi: 10.1038/nbt.2835. Epub 2014 Feb 16.
9
QualComp: a new lossy compressor for quality scores based on rate distortion theory.QualComp:一种基于率失真理论的新的基于质量分数的有损压缩器。
BMC Bioinformatics. 2013 Jun 8;14:187. doi: 10.1186/1471-2105-14-187.
10
Compression of FASTQ and SAM format sequencing data.FASTQ 和 SAM 格式测序数据的压缩。
PLoS One. 2013;8(3):e59190. doi: 10.1371/journal.pone.0059190. Epub 2013 Mar 22.