• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

FASTA/Q 数据压缩器在 MapReduce-Hadoop 基因组学中的应用:轻松节省空间和时间。

FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy.

机构信息

Dipartimento di Scienze Statistiche, Università di Roma - La Sapienza, Rome, Italy.

Dipartimento di Informatica, Università di Salerno, Fisciano, Italy.

出版信息

BMC Bioinformatics. 2021 Mar 22;22(1):144. doi: 10.1186/s12859-021-04063-1.

DOI:10.1186/s12859-021-04063-1
PMID:33752596
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7986029/
Abstract

BACKGROUND

Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic.

RESULTS

We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in better space savings, and even in better execution times over compressed data, with respect to the use of generic compressors available in Hadoop, in particular for FASTQ files. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System.

CONCLUSIONS

Our Methods and the corresponding software substantially contribute to achieve space and time savings for the storage and processing of FASTA/Q files in Hadoop and Spark. Being our approach general, it is very likely that it can be applied also to FASTA/Q compression methods that will appear in the future.

AVAILABILITY

The software and the datasets are available at https://github.com/fpalini/fastdoopc.

摘要

背景

存储基因组数据是生命科学的主要成本,通过专门的数据压缩方法可以有效地解决这个问题。出于数据生产丰富的同样原因,大数据技术被视为基因组数据存储和处理的未来,其中 MapReduce-Hadoop 是领导者。令人有些惊讶的是,没有一个专门的 FASTA/Q 压缩器可用于 Hadoop。事实上,它们在那里的部署并不是立即的。这种现状存在问题。

结果

我们在两个不同的方向上取得了重大进展。在方法上,我们提出了两种通用方法,并相应地开发了软件,使得在 MapReduce-Hadoop 中轻松部署专门的 FASTA/Q 压缩器来处理存储在分布式 Hadoop 文件系统上的文件,而对 Hadoop 的了解很少。实际上,我们提供了证据表明,在 Hadoop 中部署那些迄今为止不可用的专门压缩器可以节省更多的空间,甚至可以节省压缩数据的执行时间,与 Hadoop 中可用的通用压缩器相比,特别是对于 FASTQ 文件。最后,我们观察到,当使用 Apache Spark 框架处理存储在 Hadoop 文件系统上的 FASTA/Q 文件时,这些结果也成立。

结论

我们的方法和相应的软件大大有助于在 Hadoop 和 Spark 中存储和处理 FASTA/Q 文件时节省空间和时间。由于我们的方法具有通用性,因此很可能也可以应用于未来出现的 FASTA/Q 压缩方法。

可用性

软件和数据集可在 https://github.com/fpalini/fastdoopc 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/ee918460f810/12859_2021_4063_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/86b57dd9784d/12859_2021_4063_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/ffc7776084eb/12859_2021_4063_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/42e5c2e96f43/12859_2021_4063_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/de80b9e5f59f/12859_2021_4063_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/4278e9d58acf/12859_2021_4063_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/61fadd159a53/12859_2021_4063_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/ee918460f810/12859_2021_4063_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/86b57dd9784d/12859_2021_4063_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/ffc7776084eb/12859_2021_4063_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/42e5c2e96f43/12859_2021_4063_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/de80b9e5f59f/12859_2021_4063_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/4278e9d58acf/12859_2021_4063_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/61fadd159a53/12859_2021_4063_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9bb6/7986029/ee918460f810/12859_2021_4063_Fig7_HTML.jpg

相似文献

1
FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy.FASTA/Q 数据压缩器在 MapReduce-Hadoop 基因组学中的应用:轻松节省空间和时间。
BMC Bioinformatics. 2021 Mar 22;22(1):144. doi: 10.1186/s12859-021-04063-1.
2
FASTdoop: a versatile and efficient library for the input of FASTA and FASTQ files for MapReduce Hadoop bioinformatics applications.FASTdoop:一个通用且高效的库,用于为MapReduce Hadoop生物信息学应用输入FASTA和FASTQ文件。
Bioinformatics. 2017 May 15;33(10):1575-1577. doi: 10.1093/bioinformatics/btx010.
3
Correction to: FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy.对《用于MapReduce-Hadoop基因组学的FASTA/Q数据压缩器:轻松实现空间和时间节省》的勘误
BMC Bioinformatics. 2022 Feb 15;23(1):73. doi: 10.1186/s12859-022-04600-6.
4
CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce.CloudDOE:一款用于部署Hadoop云并使用MapReduce分析高通量测序数据的用户友好型工具。
PLoS One. 2014 Jun 4;9(6):e98146. doi: 10.1371/journal.pone.0098146. eCollection 2014.
5
Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends.MapReduce 编程框架在临床大数据分析中的应用:现状与未来趋势。
BioData Min. 2014 Oct 29;7:22. doi: 10.1186/1756-0381-7-22. eCollection 2014.
6
FASTAFS: file system virtualisation of random access compressed FASTA files.FASTAFS:随机访问压缩 FASTA 文件的文件系统虚拟化。
BMC Bioinformatics. 2021 Nov 1;22(1):535. doi: 10.1186/s12859-021-04455-3.
7
PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering.PMFFRC:一种基于内存建模和冗余聚类的大规模基因组短读段压缩优化器。
BMC Bioinformatics. 2023 Nov 30;24(1):454. doi: 10.1186/s12859-023-05566-9.
8
Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences.序列压缩基准(SCB)数据库- FASTA 格式序列无参考压缩器的综合评估。
Gigascience. 2020 Jul 1;9(7). doi: 10.1093/gigascience/giaa072.
9
SPRING: a next-generation compressor for FASTQ data.SPRING:FASTQ 数据的下一代压缩程序。
Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.
10
MFCompress: a compression tool for FASTA and multi-FASTA data.MFCompress:FASTA 和多 FASTA 数据的压缩工具。
Bioinformatics. 2014 Jan 1;30(1):117-8. doi: 10.1093/bioinformatics/btt594. Epub 2013 Oct 16.

引用本文的文献

1
Review of open-source software for developing heterogeneous data management systems for bioinformatics applications.用于生物信息学应用开发异构数据管理系统的开源软件综述。
Bioinform Adv. 2025 Jul 18;5(1):vbaf168. doi: 10.1093/bioadv/vbaf168. eCollection 2025.
2
Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment.使用 Apache Spark 分布式计算环境进行生物信息学分析的十个快速技巧。
PLoS Comput Biol. 2023 Jul 20;19(7):e1011272. doi: 10.1371/journal.pcbi.1011272. eCollection 2023 Jul.
3
HybriDC: A Resource-Efficient CPU-FPGA Heterogeneous Acceleration System for Lossless Data Compression.
HybriDC:一种用于无损数据压缩的资源高效型CPU-FPGA异构加速系统。
Micromachines (Basel). 2022 Nov 19;13(11):2029. doi: 10.3390/mi13112029.
4
Big Data in Laboratory Medicine-FAIR Quality for AI?检验医学中的大数据——人工智能的FAIR质量?
Diagnostics (Basel). 2022 Aug 9;12(8):1923. doi: 10.3390/diagnostics12081923.
5
Correction to: FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy.对《用于MapReduce-Hadoop基因组学的FASTA/Q数据压缩器:轻松实现空间和时间节省》的勘误
BMC Bioinformatics. 2022 Feb 15;23(1):73. doi: 10.1186/s12859-022-04600-6.