• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

应对FASTQ参考压缩的挑战。

Tackling the Challenges of FASTQ Referential Compression.

作者信息

Guerra Aníbal, Lotero Jaime, Aedo José Édinson, Isaza Sebastián

机构信息

Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela.

Facultad de Ingeniería, Universidad de Antioquia (UdeA), Medellín, Colombia.

出版信息

Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. eCollection 2019.

DOI:10.1177/1177932218821373
PMID:30792576
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6376532/
Abstract

The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their non-referential counterparts; however, the latest tools have not been able to harness such potential yet. To reach such goal, an efficient encoding model to represent the differences between the input and the reference is needed. In this article, we introduce a novel approach for referential compression of FASTQ files. The core of our compression scheme consists of a referential compressor based on the combination of local alignments with binary encoding optimized for long reads. Here we present the algorithms and performance tests developed for our reads compression algorithm, named UdeACompress. Our compressor achieved the best results when compressing long reads and competitive compression ratios for shorter reads when compared to the best programs in the state of the art. As an added value, it also showed reasonable execution times and memory consumption, in comparison with similar tools.

摘要

基因组数据的指数级增长最近促使人们开发压缩算法,以应对生物信息学中心的存储容量限制。理论上,参考压缩器比非参考压缩器能实现更高的压缩率;然而,最新的工具尚未能够充分发挥这种潜力。为了实现这一目标,需要一种有效的编码模型来表示输入与参考之间的差异。在本文中,我们介绍了一种用于FASTQ文件参考压缩的新方法。我们压缩方案的核心是一个参考压缩器,它基于局部比对与针对长读段优化的二进制编码相结合。在这里,我们展示了为我们的读段压缩算法UdeACompress开发的算法和性能测试。与现有技术中最好的程序相比,我们的压缩器在压缩长读段时取得了最佳结果,在压缩短读段时也具有有竞争力的压缩率。此外,与类似工具相比,它还显示出合理的执行时间和内存消耗。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/9fffff55f938/10.1177_1177932218821373-fig16.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/88fe5c0757bd/10.1177_1177932218821373-fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/17e0f8ee000f/10.1177_1177932218821373-fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/c8a976424c41/10.1177_1177932218821373-fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/a0a75d40dfb2/10.1177_1177932218821373-fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/446eb4fb69fa/10.1177_1177932218821373-fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/09a3f34c24ec/10.1177_1177932218821373-fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/d6f336da8ca4/10.1177_1177932218821373-fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/173b7129ea6e/10.1177_1177932218821373-fig8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/ca4f0ec6b164/10.1177_1177932218821373-fig9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/8eb642db70cc/10.1177_1177932218821373-fig10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/8a0a88daa3de/10.1177_1177932218821373-fig11.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/e8e1502de827/10.1177_1177932218821373-fig12.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/b215c528708e/10.1177_1177932218821373-fig13.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/ecb2ae0a05b0/10.1177_1177932218821373-fig14.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/6d2e8a7b0434/10.1177_1177932218821373-fig15.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/9fffff55f938/10.1177_1177932218821373-fig16.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/88fe5c0757bd/10.1177_1177932218821373-fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/17e0f8ee000f/10.1177_1177932218821373-fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/c8a976424c41/10.1177_1177932218821373-fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/a0a75d40dfb2/10.1177_1177932218821373-fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/446eb4fb69fa/10.1177_1177932218821373-fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/09a3f34c24ec/10.1177_1177932218821373-fig6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/d6f336da8ca4/10.1177_1177932218821373-fig7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/173b7129ea6e/10.1177_1177932218821373-fig8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/ca4f0ec6b164/10.1177_1177932218821373-fig9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/8eb642db70cc/10.1177_1177932218821373-fig10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/8a0a88daa3de/10.1177_1177932218821373-fig11.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/e8e1502de827/10.1177_1177932218821373-fig12.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/b215c528708e/10.1177_1177932218821373-fig13.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/ecb2ae0a05b0/10.1177_1177932218821373-fig14.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/6d2e8a7b0434/10.1177_1177932218821373-fig15.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0f54/6376532/9fffff55f938/10.1177_1177932218821373-fig16.jpg

相似文献

1
Tackling the Challenges of FASTQ Referential Compression.应对FASTQ参考压缩的挑战。
Bioinform Biol Insights. 2019 Feb 14;13:1177932218821373. doi: 10.1177/1177932218821373. eCollection 2019.
2
SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE:使用局部一致编码提升序列压缩算法。
Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.
3
SPRING: a next-generation compressor for FASTQ data.SPRING:FASTQ 数据的下一代压缩程序。
Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.
4
PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering.PMFFRC:一种基于内存建模和冗余聚类的大规模基因组短读段压缩优化器。
BMC Bioinformatics. 2023 Nov 30;24(1):454. doi: 10.1186/s12859-023-05566-9.
5
CURC: a CUDA-based reference-free read compressor.CURC:一种基于 CUDA 的无参考读压缩器。
Bioinformatics. 2022 Jun 13;38(12):3294-3296. doi: 10.1093/bioinformatics/btac333.
6
A new efficient referential genome compression technique for FastQ files.一种用于 FastQ 文件的新型高效参照基因组压缩技术。
Funct Integr Genomics. 2023 Nov 11;23(4):333. doi: 10.1007/s10142-023-01259-x.
7
ENANO: Encoder for NANOpore FASTQ files.ENANO:用于 Nanopore FASTQ 文件的编码器。
Bioinformatics. 2020 Aug 15;36(16):4506-4507. doi: 10.1093/bioinformatics/btaa551.
8
Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.基于哈希的重排序压缩基因组测序reads:算法与分析。
Bioinformatics. 2018 Feb 15;34(4):558-567. doi: 10.1093/bioinformatics/btx639.
9
RETRACTED: LFQC: a lossless compression algorithm for FASTQ files.已撤回:LFQC:一种用于FASTQ文件的无损压缩算法。
Bioinformatics. 2019 May 1;35(9):e1-e7. doi: 10.1093/bioinformatics/btu701.
10
LFQC: a lossless compression algorithm for FASTQ files.LFQC:一种用于FASTQ文件的无损压缩算法。
Bioinformatics. 2015 Oct 15;31(20):3276-81. doi: 10.1093/bioinformatics/btv384. Epub 2015 Jun 20.

引用本文的文献

1
Telemetry Data Compression Algorithm Using Balanced Recurrent Neural Network and Deep Learning.基于平衡递归神经网络和深度学习的遥测数据压缩算法。
Comput Intell Neurosci. 2022 Jan 10;2022:4886586. doi: 10.1155/2022/4886586. eCollection 2022.

本文引用的文献

1
Long reads: their purpose and place.长读序列:它们的用途和位置。
Hum Mol Genet. 2018 Aug 1;27(R2):R234-R241. doi: 10.1093/hmg/ddy177.
2
Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.基于哈希的重排序压缩基因组测序reads:算法与分析。
Bioinformatics. 2018 Feb 15;34(4):558-567. doi: 10.1093/bioinformatics/btx639.
3
Optimal compressed representation of high throughput sequence data via light assembly.通过轻量级组装实现高通量序列数据的最优压缩表示
Nat Commun. 2018 Feb 8;9(1):566. doi: 10.1038/s41467-017-02480-6.
4
CALQ: compression of quality values of aligned sequencing data.CALQ:对齐测序数据的质量值压缩。
Bioinformatics. 2018 May 15;34(10):1650-1658. doi: 10.1093/bioinformatics/btx737.
5
LW-FQZip 2: a parallelized reference-based compression of FASTQ files.LW-FQZip 2:FASTQ文件的并行化基于参考的压缩
BMC Bioinformatics. 2017 Mar 20;18(1):179. doi: 10.1186/s12859-017-1588-x.
6
Comparison of high-throughput sequencing data compression tools.高通量测序数据压缩工具比较。
Nat Methods. 2016 Dec;13(12):1005-1008. doi: 10.1038/nmeth.4037. Epub 2016 Oct 24.
7
The real cost of sequencing: scaling computation to keep pace with data generation.测序的实际成本:扩展计算能力以跟上数据生成的步伐。
Genome Biol. 2016 Mar 23;17:53. doi: 10.1186/s13059-016-0917-0.
8
A FASTQ compressor based on integer-mapped k-mer indexing for biologist.一种基于整数映射k-mer索引的面向生物学家的FASTQ压缩器。
Gene. 2016 Mar 15;579(1):75-81. doi: 10.1016/j.gene.2015.12.053. Epub 2015 Dec 30.
9
Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.使用概率性德布鲁因图对高通量测序数据进行无参考压缩
BMC Bioinformatics. 2015 Sep 14;16:288. doi: 10.1186/s12859-015-0709-7.
10
ERGC: an efficient referential genome compression algorithm.ERGC:一种高效的参考基因组压缩算法。
Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.