• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

高通量测序reads 的压缩辅助高效从头组装。

Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

机构信息

Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195-2350, USA.

出版信息

Nucleic Acids Res. 2012 Dec;40(22):e171. doi: 10.1093/nar/gks754. Epub 2012 Aug 16.

DOI:10.1093/nar/gks754
PMID:22904078
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3526293/
Abstract

UNLABELLED

We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information.

AVAILABILITY

Quip is freely available under the 3-clause BSD license from http://cs.washington.edu/homes/dcjones/quip.

摘要

未加标签

我们提出了 Quip,这是一种用于 FASTQ 和 SAM/BAM 格式的下一代测序数据的无损压缩算法。除了实现基于参考的压缩外,我们还开发了(据我们所知)第一个基于组装的压缩器,使用了一种新颖的从头组装算法。我们使用一种概率数据结构极大地减少了传统 de Bruijn 图组装程序所需的内存,从而可以非常有效地组装数百万个读取。然后,将读取序列存储为已组装的连续统中的位置。这与读取标识符、质量分数、比对信息和序列的统计压缩相结合,有效地将非常大的数据集压缩到原始大小的 15%以下,而不会丢失任何信息。

可用性

Quip 可根据 3 条款 BSD 许可证从 http://cs.washington.edu/homes/dcjones/quip 免费获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dae9/3526293/206ac194b25e/gks754f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dae9/3526293/55dd683960a7/gks754f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dae9/3526293/f78875d98b7f/gks754f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dae9/3526293/c9c6af745b80/gks754f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dae9/3526293/206ac194b25e/gks754f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dae9/3526293/55dd683960a7/gks754f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dae9/3526293/f78875d98b7f/gks754f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dae9/3526293/c9c6af745b80/gks754f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dae9/3526293/206ac194b25e/gks754f4.jpg

相似文献

1
Compression of next-generation sequencing reads aided by highly efficient de novo assembly.高通量测序reads 的压缩辅助高效从头组装。
Nucleic Acids Res. 2012 Dec;40(22):e171. doi: 10.1093/nar/gks754. Epub 2012 Aug 16.
2
Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.使用概率性德布鲁因图对高通量测序数据进行无参考压缩
BMC Bioinformatics. 2015 Sep 14;16:288. doi: 10.1186/s12859-015-0709-7.
3
LFQC: a lossless compression algorithm for FASTQ files.LFQC:一种用于FASTQ文件的无损压缩算法。
Bioinformatics. 2015 Oct 15;31(20):3276-81. doi: 10.1093/bioinformatics/btv384. Epub 2015 Jun 20.
4
Reference-based compression of short-read sequences using path encoding.使用路径编码对短读长序列进行基于参考的压缩。
Bioinformatics. 2015 Jun 15;31(12):1920-8. doi: 10.1093/bioinformatics/btv071. Epub 2015 Feb 2.
5
Light-weight reference-based compression of FASTQ data.FASTQ数据的轻量级基于参考的压缩
BMC Bioinformatics. 2015 Jun 9;16(1):188. doi: 10.1186/s12859-015-0628-7.
6
Compression of next-generation sequencing quality scores using memetic algorithm.基于遗传算法的下一代测序质量评分压缩方法。
BMC Bioinformatics. 2014;15 Suppl 15(Suppl 15):S10. doi: 10.1186/1471-2105-15-S15-S10. Epub 2014 Dec 3.
7
MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression.MetaCRAM:一种用于宏基因组分类识别和压缩的集成流程。
BMC Bioinformatics. 2016 Feb 19;17:94. doi: 10.1186/s12859-016-0932-x.
8
SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE:使用局部一致编码提升序列压缩算法。
Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.
9
FastEtch: A Fast Sketch-Based Assembler for Genomes.FastEtch:一种基于草图的快速基因组装配器。
IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1091-1106. doi: 10.1109/TCBB.2017.2737999. Epub 2017 Sep 11.
10
Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.基因组短读段的汉明移位图:高效构建及其在压缩中的应用
PLoS Comput Biol. 2021 Jul 19;17(7):e1009229. doi: 10.1371/journal.pcbi.1009229. eCollection 2021 Jul.

引用本文的文献

1
OReO: optimizing read order for practical compression.OReO:优化实际压缩的读取顺序
Bioinform Adv. 2025 Jun 3;5(1):vbaf128. doi: 10.1093/bioadv/vbaf128. eCollection 2025.
2
Lossless and reference-free compression of FASTQ/A files using GeneSqueeze.使用GeneSqueeze对FASTQ/A文件进行无损且无参考的压缩。
Sci Rep. 2025 Jan 2;15(1):322. doi: 10.1038/s41598-024-79258-6.
3
Genie: the first open-source ISO/IEC encoder for genomic data.Genie:第一个用于基因组数据的开源 ISO/IEC 编码器。

本文引用的文献

1
Scaling metagenome sequence assembly with probabilistic de Bruijn graphs.基于概率有向图的宏基因组序列组装规模化方法。
Proc Natl Acad Sci U S A. 2012 Aug 14;109(33):13272-7. doi: 10.1073/pnas.1121464109. Epub 2012 Jul 30.
2
Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.利用布劳尔-惠勒变换对基因组序列数据库进行大规模压缩。
Bioinformatics. 2012 Jun 1;28(11):1415-9. doi: 10.1093/bioinformatics/bts173. Epub 2012 May 3.
3
Detection of redundant fusion transcripts as biomarkers or disease-specific therapeutic targets in breast cancer.
Commun Biol. 2024 May 9;7(1):553. doi: 10.1038/s42003-024-06249-8.
4
PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering.PMFFRC:一种基于内存建模和冗余聚类的大规模基因组短读段压缩优化器。
BMC Bioinformatics. 2023 Nov 30;24(1):454. doi: 10.1186/s12859-023-05566-9.
5
A Pipeline for Constructing Reference Genomes for Large Cohort-Specific Metagenome Compression.一种用于构建大型队列特异性宏基因组压缩参考基因组的流程。
Microorganisms. 2023 Oct 14;11(10):2560. doi: 10.3390/microorganisms11102560.
6
Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.使用核苷酸存档格式对严重急性呼吸综合征冠状病毒2(SARS-CoV-2)基因组数据进行高效压缩。
Patterns (N Y). 2022 Sep 9;3(9):100562. doi: 10.1016/j.patter.2022.100562. Epub 2022 Jul 7.
7
ACO:lossless quality score compression based on adaptive coding order.ACO:基于自适应编码顺序的无损质量评分压缩。
BMC Bioinformatics. 2022 Jun 7;23(1):219. doi: 10.1186/s12859-022-04712-z.
8
CoLoRd: compressing long reads.CoLoRd:压缩长读。
Nat Methods. 2022 Apr;19(4):441-444. doi: 10.1038/s41592-022-01432-3. Epub 2022 Mar 28.
9
Representation of -Mer Sets Using Spectrum-Preserving String Sets.使用谱保持串集表示 -Mer 集。
J Comput Biol. 2021 Apr;28(4):381-394. doi: 10.1089/cmb.2020.0431. Epub 2020 Dec 7.
10
Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes.基于压缩的度量方法在灵长类基因组进化中的应用比较
Entropy (Basel). 2018 May 23;20(6):393. doi: 10.3390/e20060393.
检测乳腺癌中冗余融合转录本作为生物标志物或疾病特异性治疗靶点。
Cancer Res. 2012 Apr 15;72(8):1921-8. doi: 10.1158/0008-5472.CAN-11-3142. Epub 2012 Apr 10.
4
Genome-wide Runx2 occupancy in prostate cancer cells suggests a role in regulating secretion.全基因组范围内 Runx2 在前列腺癌细胞中的占据提示其在调节分泌中的作用。
Nucleic Acids Res. 2012 Apr;40(8):3538-47. doi: 10.1093/nar/gkr1219. Epub 2011 Dec 19.
5
Transformations for the compression of FASTQ quality scores of next-generation sequencing data.用于下一代测序数据 FASTQ 质量分数压缩的变换。
Bioinformatics. 2012 Mar 1;28(5):628-35. doi: 10.1093/bioinformatics/btr689. Epub 2011 Dec 13.
6
Efficient de novo assembly of large genomes using compressed data structures.利用压缩数据结构进行高效的从头基因组组装。
Genome Res. 2012 Mar;22(3):549-56. doi: 10.1101/gr.126953.111. Epub 2011 Dec 7.
7
Parallelized short read assembly of large genomes using de Bruijn graphs.使用 de Bruijn 图进行大型基因组的并行短读序列组装。
BMC Bioinformatics. 2011 Aug 25;12:354. doi: 10.1186/1471-2105-12-354.
8
Efficient counting of k-mers in DNA sequences using a bloom filter.使用布隆过滤器高效计数 DNA 序列中的 k-mer。
BMC Bioinformatics. 2011 Aug 10;12:333. doi: 10.1186/1471-2105-12-333.
9
Compressing genomic sequence fragments using SlimGene.使用SlimGene压缩基因组序列片段。
J Comput Biol. 2011 Mar;18(3):401-13. doi: 10.1089/cmb.2010.0253.
10
Compression of DNA sequence reads in FASTQ format.FASTQ 格式下 DNA 序列读取的压缩。
Bioinformatics. 2011 Mar 15;27(6):860-2. doi: 10.1093/bioinformatics/btr014. Epub 2011 Jan 19.