• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

JARVIS3:一种用于基因组数据的高效编码器。

JARVIS3: an efficient encoder for genomic data.

作者信息

Sousa Maria J P, Pinho Armando J, Pratas Diogo

机构信息

Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal.

Department of Electronics, Telecommunications and Informatics (DETI), University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal.

出版信息

Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae725.

DOI:10.1093/bioinformatics/btae725
PMID:39673739
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11645547/
Abstract

MOTIVATION

Large-scale genomic projects grapple with the complex challenge of reducing medium- and long-term storage space and its associated energy consumption, monetary costs, and environmental footprint.

RESULTS

We present JARVIS3, an advanced tool engineered for the efficient reference-free compression of genomic sequences. JARVIS3 introduces a pioneering approach, specifically through enhanced table memory models and probabilistic lookup-tables applied in repeat models. These optimizations are pivotal in substantially enhancing computational efficiency. JARVIS3 offers three distinct profiles: (i) rapid computation with moderate compression, (ii) a balanced trade-off between time and compression, and (iii) slower computation with significantly higher compression ratios. The implementation of JARVIS3 is rooted in the C programming language, building upon the success of its predecessor, JARVIS2. JARVIS3 shows substantial speed improvements relative to JARVIS2 while providing slightly better compression. Furthermore, we provide a versatile C/Bash implementation, facilitating the application in FASTA and FASTQ data, including the capability for parallel computation. In addition, JARVIS3 includes a mode for outputting bit information, as well as providing the Normalized Compression and bit rates, facilitating compression-based analysis. This establishes JARVIS3 as an open-source solution for genomic data compression and analysis.

AVAILABILITY AND IMPLEMENTATION

JARVIS3 is freely available at https://github.com/cobilab/jarvis3.

摘要

动机

大型基因组项目面临着减少中长期存储空间及其相关能源消耗、货币成本和环境足迹这一复杂挑战。

结果

我们展示了JARVIS3,这是一种为基因组序列的高效无参考压缩而设计的先进工具。JARVIS3引入了一种开创性方法,特别是通过增强表内存模型和应用于重复模型的概率查找表。这些优化对于大幅提高计算效率至关重要。JARVIS3提供三种不同的配置文件:(i)具有适度压缩的快速计算,(ii)时间与压缩之间的平衡权衡,以及(iii)具有显著更高压缩率的较慢计算。JARVIS3的实现基于C编程语言,它是在其前身JARVIS2成功的基础上构建的。与JARVIS2相比,JARVIS3显示出大幅的速度提升,同时提供略好的压缩效果。此外,我们提供了一种通用的C/Bash实现方式,便于在FASTA和FASTQ数据中应用,包括并行计算能力。此外,JARVIS3包括一种输出位信息的模式,以及提供归一化压缩率和比特率,便于基于压缩的分析。这使JARVIS3成为基因组数据压缩和分析的开源解决方案。

可用性和实现方式

JARVIS3可在https://github.com/cobilab/jarvis3上免费获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d96/11645547/1ed1b184d9ec/btae725f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d96/11645547/1ed1b184d9ec/btae725f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d96/11645547/1ed1b184d9ec/btae725f1.jpg

相似文献

1
JARVIS3: an efficient encoder for genomic data.JARVIS3:一种用于基因组数据的高效编码器。
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae725.
2
RENANO: a REference-based compressor for NANOpore FASTQ files.RENANO:一种基于参考的 Nanopore FASTQ 文件压缩工具。
Bioinformatics. 2021 Dec 11;37(24):4862-4864. doi: 10.1093/bioinformatics/btab437.
3
Reference-based compression of short-read sequences using path encoding.使用路径编码对短读长序列进行基于参考的压缩。
Bioinformatics. 2015 Jun 15;31(12):1920-8. doi: 10.1093/bioinformatics/btv071. Epub 2015 Feb 2.
4
FaStore: a space-saving solution for raw sequencing data.FaStore:一种节省存储空间的原始测序数据解决方案。
Bioinformatics. 2018 Aug 15;34(16):2748-2756. doi: 10.1093/bioinformatics/bty205.
5
ENANO: Encoder for NANOpore FASTQ files.ENANO:用于 Nanopore FASTQ 文件的编码器。
Bioinformatics. 2020 Aug 15;36(16):4506-4507. doi: 10.1093/bioinformatics/btaa551.
6
PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.PQSDC:一种通过序列划分和游程长度预测映射对质量分数数据进行并行无损压缩的方法。
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae323.
7
MFCompress: a compression tool for FASTA and multi-FASTA data.MFCompress:FASTA 和多 FASTA 数据的压缩工具。
Bioinformatics. 2014 Jan 1;30(1):117-8. doi: 10.1093/bioinformatics/btt594. Epub 2013 Oct 16.
8
FQC: A novel approach for efficient compression, archival, and dissemination of fastq datasets.FQC:一种用于高效压缩、存档和传播Fastq数据集的新方法。
J Bioinform Comput Biol. 2015 Jun;13(3):1541003. doi: 10.1142/S0219720015410036. Epub 2015 Feb 8.
9
Indexing -mers in linear space for quality value compression.用于质量值压缩的线性空间中的索引k-mer。
J Bioinform Comput Biol. 2019 Oct;17(5):1940011. doi: 10.1142/S0219720019400110.
10
DELIMINATE--a fast and efficient method for loss-less compression of genomic sequences: sequence analysis.DELIMINATE——一种快速高效的基因组序列无损压缩方法:序列分析。
Bioinformatics. 2012 Oct 1;28(19):2527-9. doi: 10.1093/bioinformatics/bts467. Epub 2012 Jul 25.

引用本文的文献

1
Evaluation of sequencing reads at scale using rdeval.使用rdeval大规模评估测序读数。
Bioinformatics. 2025 Jul 22. doi: 10.1093/bioinformatics/btaf416.
2
Evaluation of sequencing reads at scale using rdeval.使用rdeval对大规模测序读数进行评估。
bioRxiv. 2025 Feb 8:2025.02.01.636073. doi: 10.1101/2025.02.01.636073.

本文引用的文献

1
Unmasking the tissue-resident eukaryotic DNA virome in humans.揭示人类组织内驻留真核 DNA 病毒组。
Nucleic Acids Res. 2023 Apr 24;51(7):3223-3239. doi: 10.1093/nar/gkad199.
2
AGC: compact representation of assembled genomes with fast queries and updates.AGC:带快速查询和更新功能的组装基因组的紧凑表示。
Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad097.
3
The complete sequence of a human genome.人类基因组的完整序列。
Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.
4
The haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar reveal novel pan-genome and allele-specific transcriptome features.一个杂合二倍体非洲木薯品种的单体型解析染色体对揭示了新的泛基因组和等位基因特异性转录组特征。
Gigascience. 2022 Mar 24;11. doi: 10.1093/gigascience/giac028.
5
MBGC: Multiple Bacteria Genome Compressor.MBGC:多细菌基因组压缩器。
Gigascience. 2022 Jan 27;11. doi: 10.1093/gigascience/giab099.
6
FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model.FastqCLS:一种通过使用新型评分模型进行读段重排来压缩长读长测序FASTQ文件的工具。
Bioinformatics. 2022 Jan 3;38(2):351-356. doi: 10.1093/bioinformatics/btab696.
7
Efficient DNA sequence compression with neural networks.神经网络高效 DNA 序列压缩。
Gigascience. 2020 Nov 11;9(11). doi: 10.1093/gigascience/giaa119.
8
Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences.序列压缩基准(SCB)数据库- FASTA 格式序列无参考压缩器的综合评估。
Gigascience. 2020 Jul 1;9(7). doi: 10.1093/gigascience/giaa072.
9
Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences.核苷酸档案格式 (NAF) 可实现 DNA 序列的高效无损、无参考自由压缩。
Bioinformatics. 2019 Oct 1;35(19):3826-3828. doi: 10.1093/bioinformatics/btz144.
10
SPRING: a next-generation compressor for FASTQ data.SPRING:FASTQ 数据的下一代压缩程序。
Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.