Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal.
Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal.
Gigascience. 2020 Nov 11;9(11). doi: 10.1093/gigascience/giaa119.
The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models.
We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $2.4%$, $7.1%$, $6.1%$, $5.8%$, and $6.0%$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $12.4%$, $11.7%$, $10.8%$, and $10.1%$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7-3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art.
GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.
基因组数据的产量不断增加,这使得人们对能够高效处理 DNA 序列无损压缩的模型的需求日益迫切。重要的应用包括长期存储和基于压缩的数据分析。在文献中,只有少数最近的文章提出使用神经网络进行 DNA 序列压缩。然而,与特定的 DNA 压缩工具(如 GeCo2)相比,它们存在一些局限性。这种局限性是由于缺乏专门针对 DNA 序列的模型。在这项工作中,我们将神经网络的功能与特定的 DNA 模型相结合。为此,我们创建了 GeCo3,这是一种新的基因组序列压缩器,它使用神经网络来混合多种上下文和容忍替换的上下文模型。
我们将 GeCo3 作为无参考 DNA 压缩器在 5 个数据集上进行基准测试,其中包括一个包含 DNA 序列的平衡和全面数据集、Y 染色体和人类线粒体基因组、2 个古细菌和病毒基因组汇编、4 个全基因组以及 2 个人类病毒组和古代 DNA 的 FASTQ 数据集合。GeCo3 在压缩方面相对于上一个版本(GeCo2)分别实现了 2.4%、7.1%、6.1%、5.8%和 6.0%的显著改进。为了测试其作为基于参考的 DNA 压缩器的性能,我们在由几个灵长类动物基因组的染色体对压缩构成的 4 个数据集上对 GeCo3 进行了基准测试。GeCo3 相对于现有技术在 12.4%、11.7%、10.8%和 10.1%的情况下提高了压缩性能。这种压缩性能的提升需要付出一些额外的计算时间(比 GeCo2 慢 1.7-3 倍)。RAM 使用量保持不变,并且该工具可以有效地扩展,而与序列大小无关。总的来说,这些值都优于现有技术。
GeCo3 是一种具有神经网络混合方法的基因组序列压缩器,它提供了相对于顶级特定基因组压缩器的额外收益。所提出的混合方法是可移植的,只需要模型的概率作为输入,这为其他数据压缩器或基于压缩的数据分析工具提供了轻松的适应性。GeCo3 在 GPLv3 下发布,可在 https://github.com/cobilab/geco3 上免费下载。