• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种用于 FastQ 文件的新型高效参照基因组压缩技术。

A new efficient referential genome compression technique for FastQ files.

机构信息

United University, Prayagraj, Uttar Pradesh, 211012, India.

School of Computer Science Engineering and Technology, Bennett University, Greater Noida, Uttar Pradesh, 201310, India.

出版信息

Funct Integr Genomics. 2023 Nov 11;23(4):333. doi: 10.1007/s10142-023-01259-x.

DOI:10.1007/s10142-023-01259-x
PMID:37950100
Abstract

Hospitals and medical laboratories create a tremendous amount of genome sequence data every day for use in research, surgery, and illness diagnosis. To make storage comprehensible, compression is therefore essential for the storage, monitoring, and distribution of all these data. A novel data compression technique is required to reduce the time as well as the cost of storage, transmission, and data processing. General-purpose compression techniques do not perform so well for these data due to their special features: a large number of repeats (tandem and palindrome), small alphabets, and highly similar, and specific file formats. In this study, we provide a method for compressing FastQ files that uses a reference genome as a backup without sacrificing data quality. FastQ files are initially split into three streams (identifier, sequence, and quality score), each of which receives its own compression technique. A novel quick and lightweight mapping mechanism is also presented to effectively compress the sequence stream. As shown by experiments, the suggested methods, both the compression ratio and the compression/decompression duration of NGS data compressed using RBFQC, are superior to those achieved by other state-of-the-art genome compression methods. In comparison to GZIP, RBFQC may achieve a compression ratio of 80-140% for fixed-length datasets and 80-125% for variable-length datasets. Compared to domain-specific FastQ file referential genome compression techniques, RBFQC has a compression and decompression speed (total) improvement of 10-25%.

摘要

医院和医学实验室每天都会生成大量用于研究、手术和疾病诊断的基因组序列数据。为了使存储更加易于理解,因此压缩对于存储、监控和分发所有这些数据至关重要。需要一种新的数据压缩技术来减少存储、传输和数据处理的时间和成本。由于其特殊特征,通用压缩技术在处理这些数据时表现不佳:大量重复(串联和回文)、小字母表、高度相似和特定文件格式。在本研究中,我们提供了一种使用参考基因组作为备份的方法来压缩 FastQ 文件,而不会牺牲数据质量。FastQ 文件最初分为三部分(标识符、序列和质量分数),每一部分都使用自己的压缩技术。还提出了一种新颖的快速轻量级映射机制,以有效地压缩序列流。实验结果表明,所提出的方法在使用 RBFQC 压缩 NGS 数据的压缩比和压缩/解压缩时间方面均优于其他最先进的基因组压缩方法。与 GZIP 相比,RBFQC 可以为固定长度数据集实现 80-140%的压缩比,为可变长度数据集实现 80-125%的压缩比。与特定于领域的 FastQ 文件参考基因组压缩技术相比,RBFQC 在压缩和解压缩速度(总和)方面提高了 10-25%。

相似文献

1
A new efficient referential genome compression technique for FastQ files.一种用于 FastQ 文件的新型高效参照基因组压缩技术。
Funct Integr Genomics. 2023 Nov 11;23(4):333. doi: 10.1007/s10142-023-01259-x.
2
WBFQC: A new approach for compressing next-generation sequencing data splitting into homogeneous streams.WBFQC:一种将下一代测序数据分割为同质流进行压缩的新方法。
J Bioinform Comput Biol. 2018 Oct;16(5):1850018. doi: 10.1142/S021972001850018X. Epub 2018 Jun 28.
3
SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE:使用局部一致编码提升序列压缩算法。
Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.
4
LFQC: a lossless compression algorithm for FASTQ files.LFQC:一种用于FASTQ文件的无损压缩算法。
Bioinformatics. 2015 Oct 15;31(20):3276-81. doi: 10.1093/bioinformatics/btv384. Epub 2015 Jun 20.
5
Light-weight reference-based compression of FASTQ data.FASTQ数据的轻量级基于参考的压缩
BMC Bioinformatics. 2015 Jun 9;16(1):188. doi: 10.1186/s12859-015-0628-7.
6
FQC: A novel approach for efficient compression, archival, and dissemination of fastq datasets.FQC:一种用于高效压缩、存档和传播Fastq数据集的新方法。
J Bioinform Comput Biol. 2015 Jun;13(3):1541003. doi: 10.1142/S0219720015410036. Epub 2015 Feb 8.
7
CIndex: compressed indexes for fast retrieval of FASTQ files.CIndex:用于快速检索FASTQ文件的压缩索引。
Bioinformatics. 2022 Jan 3;38(2):335-343. doi: 10.1093/bioinformatics/btab655.
8
Compression of next-generation sequencing quality scores using memetic algorithm.基于遗传算法的下一代测序质量评分压缩方法。
BMC Bioinformatics. 2014;15 Suppl 15(Suppl 15):S10. doi: 10.1186/1471-2105-15-S15-S10. Epub 2014 Dec 3.
9
BEETL-fastq: a searchable compressed archive for DNA reads.BEETL-fastq:一种用于DNA读数的可搜索压缩存档。
Bioinformatics. 2014 Oct;30(19):2796-801. doi: 10.1093/bioinformatics/btu387. Epub 2014 Jun 20.
10
PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.PQSDC:一种通过序列划分和游程长度预测映射对质量分数数据进行并行无损压缩的方法。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae323.

引用本文的文献

1
DeepSplice: a deep learning approach for accurate prediction of alternative splicing events in the human genome.DeepSplice:一种用于准确预测人类基因组中可变剪接事件的深度学习方法。
Front Genet. 2024 Jun 21;15:1349546. doi: 10.3389/fgene.2024.1349546. eCollection 2024.