• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

人类短读长序列数据压缩软件的基准研究。

A benchmark study of compression software for human short-read sequence data.

作者信息

Betschart Raphael O, Thalén Felix, Blankenberg Stefan, Zoche Martin, Zeller Tanja, Ziegler Andreas

机构信息

Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 12, Davos Wolfgang, 7265, Davos, Switzerland.

Institute of Cardiogenetics, University of Lübeck, Lübeck, Germany.

出版信息

Sci Rep. 2025 May 2;15(1):15358. doi: 10.1038/s41598-025-00491-8.

DOI:10.1038/s41598-025-00491-8
PMID:40316539
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12048562/
Abstract

Efficient data compression technologies are crucial to reduce the cost of long-term storage and file transfer in whole genome sequencing studies. This study benchmarked four specialized compression tools developed for paired-end fastq.gz files DRAGEN ORA 4.3.4 (ORA), Genozip 15.0.62, repaq 0.3.0, and SPRING 1.1.1 using three subjects from the genome-in-a-bottle consortium that were sequenced 82 times on an Illumina NovaSeq 6000, with an average coverage of 35x. It additionally compared Genozip with SAMtools 1.20 for the compression of BAM files. All tools provided lossless compression. ORA and Genozip achieved compression ratios of approximately 1:6 when compressing fastq.gz. repaq and SPRING had lower compression ratios of 1:2 and 1:4, respectively. repaq and SPRING took longer for both compression and decompression than ORA and Genozip. Genozip had approximately 16% higher compression for BAM files than SAMtools. However, the BAM compression of SAMtools produces CRAM files, which are compatible with many software packages. ORA, repaq, and SPRING are limited to compressing fastq.gz files, while Genozip supports various file formats. Although Genozip requires an annual license, its source code is freely available, ensuring sustainability. In conclusion, paired-end short-read sequence data can be efficiently compressed using specialized compression software. Commercial tools offer higher compression ratios than freely available software.

摘要

高效的数据压缩技术对于降低全基因组测序研究中的长期存储成本和文件传输成本至关重要。本研究对为双端fastq.gz文件开发的四种专用压缩工具进行了基准测试,即DRAGEN ORA 4.3.4(ORA)、Genozip 15.0.62、repaq 0.3.0和SPRING 1.1.1,使用了来自基因组瓶子联盟的三个样本,这些样本在Illumina NovaSeq 6000上进行了82次测序,平均覆盖度为35倍。此外,还比较了Genozip和SAMtools 1.20对BAM文件的压缩情况。所有工具都提供无损压缩。压缩fastq.gz文件时,ORA和Genozip的压缩比约为1:6。repaq和SPRING的压缩比分别较低,为1:2和1:4。repaq和SPRING的压缩和解压缩时间都比ORA和Genozip长。Genozip对BAM文件的压缩比SAMtools高约16%。然而,SAMtools的BAM压缩会生成CRAM文件,这些文件与许多软件包兼容。ORA、repaq和SPRING仅限于压缩fastq.gz文件,而Genozip支持各种文件格式。虽然Genozip需要年度许可证,但其源代码可免费获取,确保了可持续性。总之,使用专用压缩软件可以有效地压缩双端短读长序列数据。商业工具比免费软件提供更高的压缩比。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3aa2/12048562/8dee91e1f8c9/41598_2025_491_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3aa2/12048562/77d29cb2ae74/41598_2025_491_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3aa2/12048562/8dee91e1f8c9/41598_2025_491_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3aa2/12048562/77d29cb2ae74/41598_2025_491_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3aa2/12048562/8dee91e1f8c9/41598_2025_491_Fig2_HTML.jpg

相似文献

1
A benchmark study of compression software for human short-read sequence data.人类短读长序列数据压缩软件的基准研究。
Sci Rep. 2025 May 2;15(1):15358. doi: 10.1038/s41598-025-00491-8.
2
Lossless and reference-free compression of FASTQ/A files using GeneSqueeze.使用GeneSqueeze对FASTQ/A文件进行无损且无参考的压缩。
Sci Rep. 2025 Jan 2;15(1):322. doi: 10.1038/s41598-024-79258-6.
3
Genozip: a universal extensible genomic data compressor.Genozip:一种通用的可扩展基因组数据压缩器。
Bioinformatics. 2021 Aug 25;37(16):2225-2230. doi: 10.1093/bioinformatics/btab102.
4
genozip: a fast and efficient compression tool for VCF files.genozip:一种用于 VCF 文件的快速高效压缩工具。
Bioinformatics. 2020 Jul 1;36(13):4091-4092. doi: 10.1093/bioinformatics/btaa290.
5
RENANO: a REference-based compressor for NANOpore FASTQ files.RENANO:一种基于参考的 Nanopore FASTQ 文件压缩工具。
Bioinformatics. 2021 Dec 11;37(24):4862-4864. doi: 10.1093/bioinformatics/btab437.
6
A new efficient referential genome compression technique for FastQ files.一种用于 FastQ 文件的新型高效参照基因组压缩技术。
Funct Integr Genomics. 2023 Nov 11;23(4):333. doi: 10.1007/s10142-023-01259-x.
7
SPRING: a next-generation compressor for FASTQ data.SPRING:FASTQ 数据的下一代压缩程序。
Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.
8
Efficient sequencing data compression and FPGA acceleration based on a two-step framework.基于两步框架的高效测序数据压缩与现场可编程门阵列加速
Front Genet. 2023 Sep 21;14:1260531. doi: 10.3389/fgene.2023.1260531. eCollection 2023.
9
MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression.MetaCRAM:一种用于宏基因组分类识别和压缩的集成流程。
BMC Bioinformatics. 2016 Feb 19;17:94. doi: 10.1186/s12859-016-0932-x.
10
CRAM 3.1: advances in the CRAM file format.CRAM 3.1:CRAM 文件格式的新进展。
Bioinformatics. 2022 Mar 4;38(6):1497-1503. doi: 10.1093/bioinformatics/btac010.

本文引用的文献

1
The autophagy receptor Ncoa4 controls PPARγ activity and thermogenesis in brown adipose tissue.自噬受体Ncoa4调控棕色脂肪组织中的PPARγ活性和产热作用。
bioRxiv. 2025 Feb 2:2025.02.02.636110. doi: 10.1101/2025.02.02.636110.
2
Lossless and reference-free compression of FASTQ/A files using GeneSqueeze.使用GeneSqueeze对FASTQ/A文件进行无损且无参考的压缩。
Sci Rep. 2025 Jan 2;15(1):322. doi: 10.1038/s41598-024-79258-6.
3
GSC: efficient lossless compression of VCF files with fast query.GSC:实现 VCF 文件的高效无损压缩和快速查询
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae046.
4
Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control.全基因组测序研究的生物统计学方面:预处理和质量控制。
Biom J. 2024 Jul;66(5):e202300278. doi: 10.1002/bimj.202300278.
5
Efficient sequencing data compression and FPGA acceleration based on a two-step framework.基于两步框架的高效测序数据压缩与现场可编程门阵列加速
Front Genet. 2023 Sep 21;14:1260531. doi: 10.3389/fgene.2023.1260531. eCollection 2023.
6
GVC: efficient random access compression for gene sequence variations.GVC:基因序列变异的高效随机访问压缩。
BMC Bioinformatics. 2023 Mar 28;24(1):121. doi: 10.1186/s12859-023-05240-0.
7
Method of the year: long-read sequencing.年度方法:长读长测序。
Nat Methods. 2023 Jan;20(1):6-11. doi: 10.1038/s41592-022-01730-w.
8
Comparison of calling pipelines for whole genome sequencing: an empirical study demonstrating the importance of mapping and alignment.比较全基因组测序的调用管道:一项实证研究表明映射和比对的重要性。
Sci Rep. 2022 Dec 13;12(1):21502. doi: 10.1038/s41598-022-26181-3.
9
Rapid shifting of a deep magmatic source at Fagradalsfjall volcano, Iceland.冰岛法格拉达尔火山深部岩浆源的快速迁移。
Nature. 2022 Sep;609(7927):529-534. doi: 10.1038/s41586-022-04981-x. Epub 2022 Sep 14.
10
The sequences of 150,119 genomes in the UK Biobank.英国生物库中 150119 个基因组的序列。
Nature. 2022 Jul;607(7920):732-740. doi: 10.1038/s41586-022-04965-x. Epub 2022 Jul 20.