数据依赖分桶法可改善测序读数的无参考压缩。

Data-dependent bucketing improves reference-free compression of sequencing reads.

作者信息

Patro Rob, Kingsford Carl

机构信息

Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA and.

Department Computational Biology, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA.

出版信息

Bioinformatics. 2015 Sep 1;31(17):2770-7. doi: 10.1093/bioinformatics/btv248. Epub 2015 Apr 24.

DOI:10.1093/bioinformatics/btv248

PMID:25910696

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4547610/

Abstract

MOTIVATION

The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data.

RESULTS

We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes.

AVAILABILITY AND IMPLEMENTATION

Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/∼ckingsf/software/mince.

CONTACT

carlk@cs.cmu.edu

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

高通量测序数据的存储和传输消耗大量资源。随着我们生成此类数据的能力不断提高，这种负担只会越来越重。减少存储和传输需求的一种方法是压缩这种测序数据。

结果

我们提出了一种新颖的技术来提高测序数据的压缩率，该技术基于对相似读段进行分组的概念，以便它们在文件中相邻出现。我们证明，通过采用数据依赖的分组方案并运用多种编码思路，我们能够实现比现有从头测序压缩工具（包括其他分组和重排方案）显著更高的压缩率。我们的方法Mince与现有的最先进的从头压缩方案相比，文件大小减少了高达45%（平均减少28%）。

可用性与实现

Mince用C++11编写，是开源的，已根据GPLv3许可发布。可在http://www.cs.cmu.edu/∼ckingsf/software/mince获取。

联系方式

carlk@cs.cmu.edu

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a4b8/4547610/5a4f57bed8af/btv248f1p.jpg

相似文献

Data-dependent bucketing improves reference-free compression of sequencing reads.数据依赖分桶法可改善测序读数的无参考压缩。

Bioinformatics. 2015 Sep 1;31(17):2770-7. doi: 10.1093/bioinformatics/btv248. Epub 2015 Apr 24.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Short-Term Memory Impairment短期记忆障碍

The Black Book of Psychotropic Dosing and Monitoring.《精神药物剂量与监测黑皮书》

Psychopharmacol Bull. 2024 Jul 8;54(3):8-59.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Decontamination of DNA sequences from a Streptomyces genome for optimal genome mining.对链霉菌基因组中的DNA序列进行净化以实现最佳基因组挖掘。

Braz J Microbiol. 2025 Mar;56(1):79-89. doi: 10.1007/s42770-024-01598-2. Epub 2025 Jan 15.

Anterior Approach Total Ankle Arthroplasty with Patient-Specific Cut Guides.使用患者特异性截骨导向器的前路全踝关节置换术。

JBJS Essent Surg Tech. 2025 Aug 15;15(3). doi: 10.2106/JBJS.ST.23.00027. eCollection 2025 Jul-Sep.

cONcat: Computational reconstruction of concatenated fragments from long Oxford Nanopore reads.cONcat：从长牛津纳米孔测序读段中进行串联片段的计算重建。

PLoS One. 2025 Jul 24;20(7):e0321246. doi: 10.1371/journal.pone.0321246. eCollection 2025.

Evaluation of sequencing reads at scale using rdeval.使用rdeval对大规模测序读数进行评估。

bioRxiv. 2025 Feb 8:2025.02.01.636073. doi: 10.1101/2025.02.01.636073.

Reference-based compression of short-read sequences using path encoding.使用路径编码对短读长序列进行基于参考的压缩。

Bioinformatics. 2015 Jun 15;31(12):1920-8. doi: 10.1093/bioinformatics/btv071. Epub 2015 Feb 2.

引用本文的文献

OReO: optimizing read order for practical compression.OReO：优化实际压缩的读取顺序

Bioinform Adv. 2025 Jun 3;5(1):vbaf128. doi: 10.1093/bioadv/vbaf128. eCollection 2025.

Efficient and robust search of microbial genomes via phylogenetic compression.通过系统发育压缩对微生物基因组进行高效且稳健的搜索。

Nat Methods. 2025 Apr;22(4):692-697. doi: 10.1038/s41592-025-02625-2. Epub 2025 Apr 9.

Image-centric compression of protein structures improves space savings.以图像为中心的蛋白质结构压缩可提高节省空间的效果。

BMC Bioinformatics. 2023 Nov 21;24(1):437. doi: 10.1186/s12859-023-05570-z.

Creating and Using Minimizer Sketches in Computational Genomics.在计算基因组学中创建和使用最小草图。

J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.

Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.基因组短读段的汉明移位图：高效构建及其在压缩中的应用

PLoS Comput Biol. 2021 Jul 19;17(7):e1009229. doi: 10.1371/journal.pcbi.1009229. eCollection 2021 Jul.

Deficient autophagy in epithelial stem cells drives aging in the freshwater cnidarian .上皮干细胞中的自噬缺陷导致淡水刺胞动物衰老。

Development. 2020 Jan 23;147(2):dev177840. doi: 10.1242/dev.177840.

BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs.BdBG：一种基于桶的方法，用于使用动态德布鲁因图压缩基因组测序数据。

PeerJ. 2018 Oct 19;6:e5611. doi: 10.7717/peerj.5611. eCollection 2018.

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.基于哈希的重排序压缩基因组测序reads：算法与分析。

Bioinformatics. 2018 Feb 15;34(4):558-567. doi: 10.1093/bioinformatics/btx639.

Optimal compressed representation of high throughput sequence data via light assembly.通过轻量级组装实现高通量序列数据的最优压缩表示

Nat Commun. 2018 Feb 8;9(1):566. doi: 10.1038/s41467-017-02480-6.

Computational Biology in the 21st Century: Scaling with Compressive Algorithms.21世纪的计算生物学：借助压缩算法实现规模扩展。

Commun ACM. 2016 Aug;59(8):72-80. doi: 10.1145/2957324.

本文引用的文献

Quality score compression improves genotyping accuracy.质量分数压缩提高了基因分型准确性。

Nat Biotechnol. 2015 Mar;33(3):240-3. doi: 10.1038/nbt.3170.

Reference-based compression of short-read sequences using path encoding.使用路径编码对短读长序列进行基于参考的压缩。

Bioinformatics. 2015 Jun 15;31(12):1920-8. doi: 10.1093/bioinformatics/btv071. Epub 2015 Feb 2.

Fast lossless compression via cascading Bloom filters.通过级联布隆过滤器实现快速无损压缩。

BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S7. doi: 10.1186/1471-2105-15-S9-S7. Epub 2014 Sep 10.

Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.旗鱼能够使用轻量级算法从RNA测序读段中进行无比对的异构体定量分析。

Nat Biotechnol. 2014 May;32(5):462-4. doi: 10.1038/nbt.2862. Epub 2014 Apr 20.

Lossy compression of quality scores in genomic data.基因组数据中质量分数的有损压缩。

Bioinformatics. 2014 Aug 1;30(15):2130-6. doi: 10.1093/bioinformatics/btu183. Epub 2014 Apr 10.

HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads.HUGO：用于比对读段的分层多参考基因组压缩。

J Am Med Inform Assoc. 2014 Mar-Apr;21(2):363-73. doi: 10.1136/amiajnl-2013-002147. Epub 2013 Dec 24.

Ensembl 2014.Ensembl 2014.

Nucleic Acids Res. 2014 Jan;42(Database issue):D749-55. doi: 10.1093/nar/gkt1196. Epub 2013 Dec 6.

Compression of structured high-throughput sequencing data.结构化高通量测序数据的压缩。

PLoS One. 2013 Nov 18;8(11):e79871. doi: 10.1371/journal.pone.0079871. eCollection 2013.

Data compression for sequencing data.测序数据的数据压缩

Algorithms Mol Biol. 2013 Nov 18;8(1):25. doi: 10.1186/1748-7188-8-25.

The human genome contracts again.人类基因组再次收缩。

Bioinformatics. 2013 Sep 1;29(17):2199-202. doi: 10.1093/bioinformatics/btt362. Epub 2013 Jun 22.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

数据依赖分桶法可改善测序读数的无参考压缩。

Data-dependent bucketing improves reference-free compression of sequencing reads.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

CONTACT

SUPPLEMENTARY INFORMATION

动机

结果

可用性与实现

联系方式

补充信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献