Suppr超能文献

FASTA/Q 数据压缩器在 MapReduce-Hadoop 基因组学中的应用:轻松节省空间和时间。

FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy.

机构信息

Dipartimento di Scienze Statistiche, Università di Roma - La Sapienza, Rome, Italy.

Dipartimento di Informatica, Università di Salerno, Fisciano, Italy.

出版信息

BMC Bioinformatics. 2021 Mar 22;22(1):144. doi: 10.1186/s12859-021-04063-1.

Abstract

BACKGROUND

Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic.

RESULTS

We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in better space savings, and even in better execution times over compressed data, with respect to the use of generic compressors available in Hadoop, in particular for FASTQ files. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System.

CONCLUSIONS

Our Methods and the corresponding software substantially contribute to achieve space and time savings for the storage and processing of FASTA/Q files in Hadoop and Spark. Being our approach general, it is very likely that it can be applied also to FASTA/Q compression methods that will appear in the future.

AVAILABILITY

The software and the datasets are available at https://github.com/fpalini/fastdoopc.

摘要

背景

存储基因组数据是生命科学的主要成本,通过专门的数据压缩方法可以有效地解决这个问题。出于数据生产丰富的同样原因,大数据技术被视为基因组数据存储和处理的未来,其中 MapReduce-Hadoop 是领导者。令人有些惊讶的是,没有一个专门的 FASTA/Q 压缩器可用于 Hadoop。事实上,它们在那里的部署并不是立即的。这种现状存在问题。

结果

我们在两个不同的方向上取得了重大进展。在方法上,我们提出了两种通用方法,并相应地开发了软件,使得在 MapReduce-Hadoop 中轻松部署专门的 FASTA/Q 压缩器来处理存储在分布式 Hadoop 文件系统上的文件,而对 Hadoop 的了解很少。实际上,我们提供了证据表明,在 Hadoop 中部署那些迄今为止不可用的专门压缩器可以节省更多的空间,甚至可以节省压缩数据的执行时间,与 Hadoop 中可用的通用压缩器相比,特别是对于 FASTQ 文件。最后,我们观察到,当使用 Apache Spark 框架处理存储在 Hadoop 文件系统上的 FASTA/Q 文件时,这些结果也成立。

结论

我们的方法和相应的软件大大有助于在 Hadoop 和 Spark 中存储和处理 FASTA/Q 文件时节省空间和时间。由于我们的方法具有通用性,因此很可能也可以应用于未来出现的 FASTA/Q 压缩方法。

可用性

软件和数据集可在 https://github.com/fpalini/fastdoopc 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/86b57dd9784d/12859_2021_4063_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验