神经网络高效 DNA 序列压缩。

Efficient DNA sequence compression with neural networks.

机构信息

Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal.

Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal.

出版信息

Gigascience. 2020 Nov 11;9(11). doi: 10.1093/gigascience/giaa119.

DOI:10.1093/gigascience/giaa119

PMID:33179040

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7657843/

Abstract

BACKGROUND

The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models.

FINDINGS

We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $2.4%$, $7.1%$, $6.1%$, $5.8%$, and $6.0%$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $12.4%$, $11.7%$, $10.8%$, and $10.1%$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7-3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art.

CONCLUSIONS

GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.

摘要

背景

基因组数据的产量不断增加，这使得人们对能够高效处理 DNA 序列无损压缩的模型的需求日益迫切。重要的应用包括长期存储和基于压缩的数据分析。在文献中，只有少数最近的文章提出使用神经网络进行 DNA 序列压缩。然而，与特定的 DNA 压缩工具（如 GeCo2）相比，它们存在一些局限性。这种局限性是由于缺乏专门针对 DNA 序列的模型。在这项工作中，我们将神经网络的功能与特定的 DNA 模型相结合。为此，我们创建了 GeCo3，这是一种新的基因组序列压缩器，它使用神经网络来混合多种上下文和容忍替换的上下文模型。

发现

我们将 GeCo3 作为无参考 DNA 压缩器在 5 个数据集上进行基准测试，其中包括一个包含 DNA 序列的平衡和全面数据集、Y 染色体和人类线粒体基因组、2 个古细菌和病毒基因组汇编、4 个全基因组以及 2 个人类病毒组和古代 DNA 的 FASTQ 数据集合。GeCo3 在压缩方面相对于上一个版本（GeCo2）分别实现了 2.4%、7.1%、6.1%、5.8%和 6.0%的显著改进。为了测试其作为基于参考的 DNA 压缩器的性能，我们在由几个灵长类动物基因组的染色体对压缩构成的 4 个数据集上对 GeCo3 进行了基准测试。GeCo3 相对于现有技术在 12.4%、11.7%、10.8%和 10.1%的情况下提高了压缩性能。这种压缩性能的提升需要付出一些额外的计算时间（比 GeCo2 慢 1.7-3 倍）。RAM 使用量保持不变，并且该工具可以有效地扩展，而与序列大小无关。总的来说，这些值都优于现有技术。

结论

GeCo3 是一种具有神经网络混合方法的基因组序列压缩器，它提供了相对于顶级特定基因组压缩器的额外收益。所提出的混合方法是可移植的，只需要模型的概率作为输入，这为其他数据压缩器或基于压缩的数据分析工具提供了轻松的适应性。GeCo3 在 GPLv3 下发布，可在 https://github.com/cobilab/geco3 上免费下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a1e7/7657843/bfbaba1bc4fa/giaa119fig1.jpg

相似文献

Efficient DNA sequence compression with neural networks.神经网络高效 DNA 序列压缩。

Gigascience. 2020 Nov 11;9(11). doi: 10.1093/gigascience/giaa119.

SPRING: a next-generation compressor for FASTQ data.SPRING：FASTQ 数据的下一代压缩程序。

Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models.AC2：一种使用人工神经网络和缓存哈希模型的高效蛋白质序列压缩工具。

Entropy (Basel). 2021 Apr 26;23(5):530. doi: 10.3390/e23050530.

RENANO: a REference-based compressor for NANOpore FASTQ files.RENANO：一种基于参考的 Nanopore FASTQ 文件压缩工具。

Bioinformatics. 2021 Dec 11;37(24):4862-4864. doi: 10.1093/bioinformatics/btab437.

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences.核苷酸档案格式 (NAF) 可实现 DNA 序列的高效无损、无参考自由压缩。

Bioinformatics. 2019 Oct 1;35(19):3826-3828. doi: 10.1093/bioinformatics/btz144.

LFastqC: A lossless non-reference-based FASTQ compressor.LFastqC：一种无损的非参考型 FASTQ 压缩器。

PLoS One. 2019 Nov 14;14(11):e0224806. doi: 10.1371/journal.pone.0224806. eCollection 2019.

Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach.使用近似组装方法对纳米孔测序读取进行无参考无损压缩。

Sci Rep. 2023 Feb 6;13(1):2082. doi: 10.1038/s41598-023-29267-8.

PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.PQSDC：一种通过序列划分和游程长度预测映射对质量分数数据进行并行无损压缩的方法。

Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae323.

LCQS: an efficient lossless compression tool of quality scores with random access functionality.LCQS：一种具有随机访问功能的高效无损质量评分压缩工具。

BMC Bioinformatics. 2020 Mar 18;21(1):109. doi: 10.1186/s12859-020-3428-7.

CURC: a CUDA-based reference-free read compressor.CURC：一种基于 CUDA 的无参考读压缩器。

Bioinformatics. 2022 Jun 13;38(12):3294-3296. doi: 10.1093/bioinformatics/btac333.

引用本文的文献

A lossless reference-free sequence compression algorithm leveraging grammatical, statistical, and substitution rules.一种利用语法、统计和替换规则的无损无参考序列压缩算法。

Brief Funct Genomics. 2025 Jan 15;24. doi: 10.1093/bfgp/elae050.

JARVIS3: an efficient encoder for genomic data.JARVIS3：一种用于基因组数据的高效编码器。

Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae725.

Generating 2D Barcode for DNA Barcode Sequences.生成 DNA 条码序列的 2D 条码。

Methods Mol Biol. 2024;2744:239-246. doi: 10.1007/978-1-0716-3581-0_15.

AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data.AlcoR：生物数据中低复杂度区域的无比对模拟、映射和可视化。

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad101. Epub 2023 Dec 13.

AGC: compact representation of assembled genomes with fast queries and updates.AGC：带快速查询和更新功能的组装基因组的紧凑表示。

Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad097.

Deep Learning in Population Genetics.群体遗传学中的深度学习。

Genome Biol Evol. 2023 Feb 3;15(2). doi: 10.1093/gbe/evad008.

DDQR (dynamic DNA QR coding): An efficient algorithm to represent DNA barcode sequences.DDQR（动态 DNA QR 编码）：一种高效的 DNA 条码序列表示算法。

PLoS One. 2023 Jan 17;18(1):e0279994. doi: 10.1371/journal.pone.0279994. eCollection 2023.

The complexity landscape of viral genomes.病毒基因组的复杂性景观。

Gigascience. 2022 Aug 11;11. doi: 10.1093/gigascience/giac079.

Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.使用核苷酸存档格式对严重急性呼吸综合征冠状病毒2（SARS-CoV-2）基因组数据进行高效压缩。

Patterns (N Y). 2022 Sep 9;3(9):100562. doi: 10.1016/j.patter.2022.100562. Epub 2022 Jul 7.

MBGC: Multiple Bacteria Genome Compressor.MBGC：多细菌基因组压缩器。

Gigascience. 2022 Jan 27;11. doi: 10.1093/gigascience/giab099.

本文引用的文献

A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level.一种用于多器官水平病毒基因组重建和分析的混合管道。

Gigascience. 2020 Aug 1;9(8). doi: 10.1093/gigascience/giaa086.

The landscape of persistent human DNA viruses in femoral bone.股骨中持续性人类 DNA 病毒的景观。

Forensic Sci Int Genet. 2020 Sep;48:102353. doi: 10.1016/j.fsigen.2020.102353. Epub 2020 Jul 8.

Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences.序列压缩基准（SCB）数据库- FASTA 格式序列无参考压缩器的综合评估。

Gigascience. 2020 Jul 1;9(7). doi: 10.1093/gigascience/giaa072.

HERQ-9 Is a New Multiplex PCR for Differentiation and Quantification of All Nine Human Herpesviruses.HERQ-9 是一种新的多重 PCR，用于区分和定量所有九种人类疱疹病毒。

mSphere. 2020 Jun 24;5(3):e00265-20. doi: 10.1128/mSphere.00265-20.

Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements.Smash++：一种无比对、节省内存的基因组重排分析工具。

Gigascience. 2020 May 1;9(5). doi: 10.1093/gigascience/giaa048.

HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data.HRCM：一种用于基因组大数据的高效混合参考压缩方法。

Biomed Res Int. 2019 Nov 16;2019:3108950. doi: 10.1155/2019/3108950. eCollection 2019.

Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage.Read-SpaM：用于低测序覆盖度细菌基因组的无组装和无比对比较。

BMC Bioinformatics. 2019 Dec 17;20(Suppl 20):638. doi: 10.1186/s12859-019-3205-7.

Human mitochondrial genome compression using machine learning techniques.利用机器学习技术压缩人类线粒体基因组。

Hum Genomics. 2019 Oct 22;13(Suppl 1):49. doi: 10.1186/s40246-019-0225-3.

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences.核苷酸档案格式 (NAF) 可实现 DNA 序列的高效无损、无参考自由压缩。

Bioinformatics. 2019 Oct 1;35(19):3826-3828. doi: 10.1093/bioinformatics/btz144.

Earth BioGenome Project: Sequencing life for the future of life.地球生物基因组计划：为生命的未来测序生命。

Proc Natl Acad Sci U S A. 2018 Apr 24;115(17):4325-4333. doi: 10.1073/pnas.1720115115.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

神经网络高效 DNA 序列压缩。

Efficient DNA sequence compression with neural networks.

机构信息

出版信息

BACKGROUND

FINDINGS

CONCLUSIONS

背景

发现

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献