• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

LW-FQZip 2:FASTQ文件的并行化基于参考的压缩

LW-FQZip 2: a parallelized reference-based compression of FASTQ files.

作者信息

Huang Zhi-An, Wen Zhenkun, Deng Qingjin, Chu Ying, Sun Yiwen, Zhu Zexuan

机构信息

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China.

School of Medicine, Shenzhen University, Shenzhen, 518060, China.

出版信息

BMC Bioinformatics. 2017 Mar 20;18(1):179. doi: 10.1186/s12859-017-1588-x.

DOI:10.1186/s12859-017-1588-x
PMID:28320326
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5359991/
Abstract

BACKGROUND

The rapid progress of high-throughput DNA sequencing techniques has dramatically reduced the costs of whole genome sequencing, which leads to revolutionary advances in gene industry. The explosively increasing volume of raw data outpaces the decreasing disk cost and the storage of huge sequencing data has become a bottleneck of downstream analyses. Data compression is considered as a solution to reduce the dependency on storage. Efficient sequencing data compression methods are highly demanded.

RESULTS

In this article, we present a lossless reference-based compression method namely LW-FQZip 2 targeted at FASTQ files. LW-FQZip 2 is improved from LW-FQZip 1 by introducing more efficient coding scheme and parallelism. Particularly, LW-FQZip 2 is equipped with a light-weight mapping model, bitwise prediction by partial matching model, arithmetic coding, and multi-threading parallelism. LW-FQZip 2 is evaluated on both short-read and long-read data generated from various sequencing platforms. The experimental results show that LW-FQZip 2 is able to obtain promising compression ratios at reasonable time and memory space costs.

CONCLUSIONS

The competence enables LW-FQZip 2 to serve as a candidate tool for archival or space-sensitive applications of high-throughput DNA sequencing data. LW-FQZip 2 is freely available at http://csse.szu.edu.cn/staff/zhuzx/LWFQZip2 and https://github.com/Zhuzxlab/LW-FQZip2 .

摘要

背景

高通量DNA测序技术的快速发展极大地降低了全基因组测序的成本,这推动了基因产业的革命性进展。原始数据量的爆炸式增长超过了磁盘成本的下降,海量测序数据的存储已成为下游分析的瓶颈。数据压缩被认为是减少对存储依赖的一种解决方案。因此,高效的测序数据压缩方法备受需求。

结果

在本文中,我们提出了一种针对FASTQ文件的基于参考的无损压缩方法,即LW-FQZip 2。LW-FQZip 2是在LW-FQZip 1的基础上改进而来,引入了更高效的编码方案和并行性。具体而言,LW-FQZip 2配备了轻量级映射模型、基于部分匹配模型的按位预测、算术编码和多线程并行性。我们在来自各种测序平台生成的短读长和长读长数据上对LW-FQZip 2进行了评估。实验结果表明,LW-FQZip 2能够在合理的时间和内存空间成本下获得可观的压缩率。

结论

LW-FQZip 2的性能使其能够成为高通量DNA测序数据存档或对空间敏感应用的候选工具。LW-FQZip 2可在http://csse.szu.edu.cn/staff/zhuzx/LWFQZip2和https://github.com/Zhuzxlab/LW-FQZip2上免费获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8a67/5359991/84a2511bef1d/12859_2017_1588_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8a67/5359991/71b7a44df88a/12859_2017_1588_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8a67/5359991/56d9ae64aeb9/12859_2017_1588_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8a67/5359991/84a2511bef1d/12859_2017_1588_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8a67/5359991/71b7a44df88a/12859_2017_1588_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8a67/5359991/56d9ae64aeb9/12859_2017_1588_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8a67/5359991/84a2511bef1d/12859_2017_1588_Fig3_HTML.jpg

相似文献

1
LW-FQZip 2: a parallelized reference-based compression of FASTQ files.LW-FQZip 2:FASTQ文件的并行化基于参考的压缩
BMC Bioinformatics. 2017 Mar 20;18(1):179. doi: 10.1186/s12859-017-1588-x.
2
Light-weight reference-based compression of FASTQ data.FASTQ数据的轻量级基于参考的压缩
BMC Bioinformatics. 2015 Jun 9;16(1):188. doi: 10.1186/s12859-015-0628-7.
3
LFQC: a lossless compression algorithm for FASTQ files.LFQC:一种用于FASTQ文件的无损压缩算法。
Bioinformatics. 2015 Oct 15;31(20):3276-81. doi: 10.1093/bioinformatics/btv384. Epub 2015 Jun 20.
4
CompMap: a reference-based compression program to speed up read mapping to related reference sequences.CompMap:一种基于参考的压缩程序,用于加速与相关参考序列的读取映射。
Bioinformatics. 2015 Feb 1;31(3):426-8. doi: 10.1093/bioinformatics/btu656. Epub 2014 Oct 4.
5
GTZ: a fast compression and cloud transmission tool optimized for FASTQ files.GTZ:一款针对 FASTQ 文件优化的快速压缩和云传输工具。
BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):549. doi: 10.1186/s12859-017-1973-5.
6
FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model.FastqCLS:一种通过使用新型评分模型进行读段重排来压缩长读长测序FASTQ文件的工具。
Bioinformatics. 2022 Jan 3;38(2):351-356. doi: 10.1093/bioinformatics/btab696.
7
smallWig: parallel compression of RNA-seq WIG files.smallWig:RNA序列WIG文件的并行压缩
Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30.
8
FaStore: a space-saving solution for raw sequencing data.FaStore:一种节省存储空间的原始测序数据解决方案。
Bioinformatics. 2018 Aug 15;34(16):2748-2756. doi: 10.1093/bioinformatics/bty205.
9
Compression of next-generation sequencing quality scores using memetic algorithm.基于遗传算法的下一代测序质量评分压缩方法。
BMC Bioinformatics. 2014;15 Suppl 15(Suppl 15):S10. doi: 10.1186/1471-2105-15-S15-S10. Epub 2014 Dec 3.
10
CALQ: compression of quality values of aligned sequencing data.CALQ:对齐测序数据的质量值压缩。
Bioinformatics. 2018 May 15;34(10):1650-1658. doi: 10.1093/bioinformatics/btx737.

引用本文的文献

1
PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.PQSDC:一种通过序列划分和游程长度预测映射对质量分数数据进行并行无损压缩的方法。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae323.
2
A Pipeline for Constructing Reference Genomes for Large Cohort-Specific Metagenome Compression.一种用于构建大型队列特异性宏基因组压缩参考基因组的流程。
Microorganisms. 2023 Oct 14;11(10):2560. doi: 10.3390/microorganisms11102560.
3
SparkGC: Spark based genome compression for large collections of genomes.

本文引用的文献

1
Comparison of high-throughput sequencing data compression tools.高通量测序数据压缩工具比较。
Nat Methods. 2016 Dec;13(12):1005-1008. doi: 10.1038/nmeth.4037. Epub 2016 Oct 24.
2
A FASTQ compressor based on integer-mapped k-mer indexing for biologist.一种基于整数映射k-mer索引的面向生物学家的FASTQ压缩器。
Gene. 2016 Mar 15;579(1):75-81. doi: 10.1016/j.gene.2015.12.053. Epub 2015 Dec 30.
3
Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.使用概率性德布鲁因图对高通量测序数据进行无参考压缩
SparkGC:基于 Spark 的基因组压缩方法,适用于大规模基因组集合。
BMC Bioinformatics. 2022 Jul 25;23(1):297. doi: 10.1186/s12859-022-04825-5.
4
CMIC: an efficient quality score compressor with random access functionality.CMIC:一种具有随机访问功能的高效质量得分压缩器。
BMC Bioinformatics. 2022 Jul 23;23(1):294. doi: 10.1186/s12859-022-04837-1.
5
GBDR: a Bayesian model for precise prediction of pathogenic microorganisms using 16S rRNA gene sequences.GBDR:一种基于 16S rRNA 基因序列的贝叶斯模型,用于精确预测病原微生物。
BMC Genomics. 2022 Mar 16;22(Suppl 1):916. doi: 10.1186/s12864-022-08423-w.
6
Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes.基于压缩的度量方法在灵长类基因组进化中的应用比较
Entropy (Basel). 2018 May 23;20(6):393. doi: 10.3390/e20060393.
7
LCQS: an efficient lossless compression tool of quality scores with random access functionality.LCQS:一种具有随机访问功能的高效无损质量评分压缩工具。
BMC Bioinformatics. 2020 Mar 18;21(1):109. doi: 10.1186/s12859-020-3428-7.
8
Tackling the Challenges of FASTQ Referential Compression.应对FASTQ参考压缩的挑战。
Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. eCollection 2019.
9
Novel link prediction for large-scale miRNA-lncRNA interaction network in a bipartite graph.二分图中大规模miRNA-lncRNA相互作用网络的新型链接预测
BMC Med Genomics. 2018 Dec 31;11(Suppl 6):113. doi: 10.1186/s12920-018-0429-8.
10
FMSM: a novel computational model for predicting potential miRNA biomarkers for various human diseases.FMSM:一种用于预测多种人类疾病潜在miRNA生物标志物的新型计算模型。
BMC Syst Biol. 2018 Dec 31;12(Suppl 9):121. doi: 10.1186/s12918-018-0664-9.
BMC Bioinformatics. 2015 Sep 14;16:288. doi: 10.1186/s12859-015-0709-7.
4
LFQC: a lossless compression algorithm for FASTQ files.LFQC:一种用于FASTQ文件的无损压缩算法。
Bioinformatics. 2015 Oct 15;31(20):3276-81. doi: 10.1093/bioinformatics/btv384. Epub 2015 Jun 20.
5
Light-weight reference-based compression of FASTQ data.FASTQ数据的轻量级基于参考的压缩
BMC Bioinformatics. 2015 Jun 9;16(1):188. doi: 10.1186/s12859-015-0628-7.
6
Data-dependent bucketing improves reference-free compression of sequencing reads.数据依赖分桶法可改善测序读数的无参考压缩。
Bioinformatics. 2015 Sep 1;31(17):2770-7. doi: 10.1093/bioinformatics/btv248. Epub 2015 Apr 24.
7
Disk-based compression of data from genome sequencing.基于磁盘的数据压缩技术在基因组测序中的应用。
Bioinformatics. 2015 May 1;31(9):1389-95. doi: 10.1093/bioinformatics/btu844. Epub 2014 Dec 22.
8
DeeZ: reference-based compression by local assembly.DeeZ:基于参考的局部组装压缩。
Nat Methods. 2014 Nov;11(11):1082-4. doi: 10.1038/nmeth.3133.
9
CompMap: a reference-based compression program to speed up read mapping to related reference sequences.CompMap:一种基于参考的压缩程序,用于加速与相关参考序列的读取映射。
Bioinformatics. 2015 Feb 1;31(3):426-8. doi: 10.1093/bioinformatics/btu656. Epub 2014 Oct 4.
10
Fast lossless compression via cascading Bloom filters.通过级联布隆过滤器实现快速无损压缩。
BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S7. doi: 10.1186/1471-2105-15-S9-S7. Epub 2014 Sep 10.