• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于分析大规模并行DNA测序数据的Hadoop框架的定量评估。

A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data.

作者信息

Siretskiy Alexey, Sundqvist Tore, Voznesenskiy Mikhail, Spjuth Ola

机构信息

Department of Information Technology, Uppsala University, P.O. Box 337, Uppsala, SE-75105 Sweden.

Department of Physical Chemistry, institute of Chemistry, St-Petersburg State University, Saint-Petersburg, Russia.

出版信息

Gigascience. 2015 Jun 4;4:26. doi: 10.1186/s13742-015-0058-5. eCollection 2015.

DOI:10.1186/s13742-015-0058-5
PMID:26045962
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4455317/
Abstract

BACKGROUND

New high-throughput technologies, such as massively parallel sequencing, have transformed the life sciences into a data-intensive field. The most common e-infrastructure for analyzing this data consists of batch systems that are based on high-performance computing resources; however, the bioinformatics software that is built on this platform does not scale well in the general case. Recently, the Hadoop platform has emerged as an interesting option to address the challenges of increasingly large datasets with distributed storage, distributed processing, built-in data locality, fault tolerance, and an appealing programming methodology.

RESULTS

In this work we introduce metrics and report on a quantitative comparison between Hadoop and a single node of conventional high-performance computing resources for the tasks of short read mapping and variant calling. We calculate efficiency as a function of data size and observe that the Hadoop platform is more efficient for biologically relevant data sizes in terms of computing hours for both split and un-split data files. We also quantify the advantages of the data locality provided by Hadoop for NGS problems, and show that a classical architecture with network-attached storage will not scale when computing resources increase in numbers. Measurements were performed using ten datasets of different sizes, up to 100 gigabases, using the pipeline implemented in Crossbow. To make a fair comparison, we implemented an improved preprocessor for Hadoop with better performance for splittable data files. For improved usability, we implemented a graphical user interface for Crossbow in a private cloud environment using the CloudGene platform. All of the code and data in this study are freely available as open source in public repositories.

CONCLUSIONS

From our experiments we can conclude that the improved Hadoop pipeline scales better than the same pipeline on high-performance computing resources, we also conclude that Hadoop is an economically viable option for the common data sizes that are currently used in massively parallel sequencing. Given that datasets are expected to increase over time, Hadoop is a framework that we envision will have an increasingly important role in future biological data analysis.

摘要

背景

新的高通量技术,如大规模平行测序,已将生命科学转变为一个数据密集型领域。用于分析此类数据的最常见电子基础设施由基于高性能计算资源的批处理系统组成;然而,构建在该平台上的生物信息学软件在一般情况下扩展性不佳。最近,Hadoop平台作为一种有趣的选择出现,可通过分布式存储、分布式处理、内置数据局部性、容错能力以及吸引人的编程方法来应对日益庞大的数据集所带来的挑战。

结果

在本研究中,我们引入了指标,并报告了Hadoop与传统高性能计算资源的单个节点在短读段比对和变异检测任务上的定量比较。我们将效率计算为数据大小的函数,并观察到就分割和未分割数据文件的计算时长而言,Hadoop平台对于生物学相关数据大小更为高效。我们还量化了Hadoop为新一代测序问题提供的数据局部性优势,并表明当计算资源数量增加时,带有网络附属存储的经典架构将无法扩展。使用Crossbow中实现的流程,对多达100吉碱基的十个不同大小的数据集进行了测量。为了进行公平比较,我们为Hadoop实现了一个性能更好的改进型预处理器,用于可分割数据文件。为了提高可用性,我们在私有云环境中使用CloudGene平台为Crossbow实现了一个图形用户界面。本研究中的所有代码和数据均可在公共存储库中作为开源免费获取。

结论

从我们的实验中可以得出结论,改进后的Hadoop流程比在高性能计算资源上运行的相同流程扩展性更好,我们还得出结论,对于当前大规模平行测序中使用的常见数据大小,Hadoop是一种经济可行的选择。鉴于数据集预计会随着时间增加,我们设想Hadoop框架在未来的生物数据分析中将发挥越来越重要的作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a129/4455317/03826c76e5d0/13742_2015_58_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a129/4455317/c220dcba190a/13742_2015_58_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a129/4455317/ac11dde32cfd/13742_2015_58_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a129/4455317/12b1a797c2f4/13742_2015_58_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a129/4455317/7e664bdb8f7a/13742_2015_58_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a129/4455317/6fa3813e5624/13742_2015_58_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a129/4455317/03826c76e5d0/13742_2015_58_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a129/4455317/c220dcba190a/13742_2015_58_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a129/4455317/ac11dde32cfd/13742_2015_58_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a129/4455317/12b1a797c2f4/13742_2015_58_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a129/4455317/7e664bdb8f7a/13742_2015_58_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a129/4455317/6fa3813e5624/13742_2015_58_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a129/4455317/03826c76e5d0/13742_2015_58_Fig6_HTML.jpg

相似文献

1
A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data.用于分析大规模并行DNA测序数据的Hadoop框架的定量评估。
Gigascience. 2015 Jun 4;4:26. doi: 10.1186/s13742-015-0058-5. eCollection 2015.
2
CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce.CloudDOE:一款用于部署Hadoop云并使用MapReduce分析高通量测序数据的用户友好型工具。
PLoS One. 2014 Jun 4;9(6):e98146. doi: 10.1371/journal.pone.0098146. eCollection 2014.
3
Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends.MapReduce 编程框架在临床大数据分析中的应用:现状与未来趋势。
BioData Min. 2014 Oct 29;7:22. doi: 10.1186/1756-0381-7-22. eCollection 2014.
4
Survey of MapReduce frame operation in bioinformatics.生物信息学中MapReduce框架操作的调查。
Brief Bioinform. 2014 Jul;15(4):637-47. doi: 10.1093/bib/bbs088. Epub 2013 Feb 7.
5
Cloudgene: a graphical execution platform for MapReduce programs on private and public clouds.Cloudgene:一个在私有云和公有云上运行 MapReduce 程序的图形化执行平台。
BMC Bioinformatics. 2012 Aug 13;13:200. doi: 10.1186/1471-2105-13-200.
6
ADS-HCSpark: A scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on Spark.ADS-HCSpark:一种可扩展的基于 Spark 的单倍型调用程序,利用自适应数据分段来加速变异调用。
BMC Bioinformatics. 2019 Feb 14;20(1):76. doi: 10.1186/s12859-019-2665-0.
7
Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.Hadoop-BAM:在云中直接操作下一代测序数据。
Bioinformatics. 2012 Mar 15;28(6):876-7. doi: 10.1093/bioinformatics/bts054. Epub 2012 Feb 2.
8
Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism.快速的云端蛋白质序列比对:HAMOND 将快速的 DIAMOND 比对与 Hadoop 并行处理相结合。
J Biotechnol. 2017 Sep 10;257:58-60. doi: 10.1016/j.jbiotec.2017.02.020. Epub 2017 Feb 21.
9
Large-scale parallel genome assembler over cloud computing environment.基于云计算环境的大规模并行基因组组装器。
J Bioinform Comput Biol. 2017 Jun;15(3):1740003. doi: 10.1142/S0219720017400030. Epub 2017 May 23.
10
CANEapp: a user-friendly application for automated next generation transcriptomic data analysis.CANEapp:一款用于自动化下一代转录组数据分析的用户友好型应用程序。
BMC Genomics. 2016 Jan 13;17:49. doi: 10.1186/s12864-015-2346-y.

引用本文的文献

1
Comparative analyses of dynamic transcriptome profiles highlight key response genes and dominant isoforms for muscle development and growth in chicken.比较转录组动态分析突出了鸡肌肉发育和生长的关键应答基因和优势异构体。
Genet Sel Evol. 2023 Oct 23;55(1):73. doi: 10.1186/s12711-023-00849-4.
2
Advances in Genomic Discovery and Implications for Personalized Prevention and Medicine: Estonia as Example.基因组发现的进展及其对个性化预防和医学的影响:以爱沙尼亚为例
J Pers Med. 2021 Apr 29;11(5):358. doi: 10.3390/jpm11050358.
3
A Genocentric Approach to Discovery of Mendelian Disorders.

本文引用的文献

1
SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.SparkSeq:一种快速、可扩展且适用于云环境的工具,可实现具有核苷酸精度的交互式基因组数据分析。
Bioinformatics. 2014 Sep 15;30(18):2652-3. doi: 10.1093/bioinformatics/btu343. Epub 2014 May 19.
2
Single Nucleotide Polymorphism (SNP) Detection and Genotype Calling from Massively Parallel Sequencing (MPS) Data.基于大规模平行测序(MPS)数据的单核苷酸多态性(SNP)检测与基因型分型
Stat Biosci. 2013 May;5(1):3-25. doi: 10.1007/s12561-012-9067-4.
3
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.
从种族中心主义角度探究孟德尔遗传病
Am J Hum Genet. 2019 Nov 7;105(5):974-986. doi: 10.1016/j.ajhg.2019.09.027. Epub 2019 Oct 24.
4
Global tissue-specific transcriptome analysis of Citrus sinensis fruit across six developmental stages.全球组织特异性柑橘果实转录组分析在六个发育阶段。
Sci Data. 2019 Aug 21;6(1):153. doi: 10.1038/s41597-019-0162-y.
5
Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics.分析基因组序列的大数据集:快速可扩展的 k-mer 统计信息收集。
BMC Bioinformatics. 2019 Apr 18;20(Suppl 4):138. doi: 10.1186/s12859-019-2694-8.
6
Optimized distributed systems achieve significant performance improvement on sorted merging of massive VCF files.优化的分布式系统在大规模 VCF 文件的排序合并方面实现了显著的性能提升。
Gigascience. 2018 Jun 1;7(6). doi: 10.1093/gigascience/giy052.
7
Challenges in the Setup of Large-scale Next-Generation Sequencing Analysis Workflows.大规模下一代测序分析工作流程设置中的挑战。
Comput Struct Biotechnol J. 2017 Oct 25;15:471-477. doi: 10.1016/j.csbj.2017.10.001. eCollection 2017.
8
Recommendations on e-infrastructures for next-generation sequencing.关于下一代测序电子基础设施的建议。
Gigascience. 2016 Jun 7;5:26. doi: 10.1186/s13742-016-0132-7.
9
Erratum to: A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data.
Gigascience. 2015 Dec 9;4:61. doi: 10.1186/s13742-015-0100-7. eCollection 2015.
10
Experiences with workflows for automating data-intensive bioinformatics.自动化数据密集型生物信息学工作流程的经验。
Biol Direct. 2015 Aug 19;10:43. doi: 10.1186/s13062-015-0071-8.
SeqPig:Hadoop 中用于大型测序数据集的简单且可扩展的脚本编制。
Bioinformatics. 2014 Jan 1;30(1):119-20. doi: 10.1093/bioinformatics/btt601. Epub 2013 Oct 22.
4
BioPig: a Hadoop-based analytic toolkit for large-scale sequence data.BioPig:一个基于 Hadoop 的大规模序列数据分析工具包。
Bioinformatics. 2013 Dec 1;29(23):3014-9. doi: 10.1093/bioinformatics/btt528. Epub 2013 Sep 10.
5
Biology: The big challenges of big data.生物学:大数据的巨大挑战。
Nature. 2013 Jun 13;498(7453):255-60. doi: 10.1038/498255a.
6
Genotyping in the cloud with Crossbow.使用Crossbow在云端进行基因分型。
Curr Protoc Bioinformatics. 2012 Sep;Chapter 15:15.3.1-15.3.15. doi: 10.1002/0471250953.bi1503s39.
7
Cloudgene: a graphical execution platform for MapReduce programs on private and public clouds.Cloudgene:一个在私有云和公有云上运行 MapReduce 程序的图形化执行平台。
BMC Bioinformatics. 2012 Aug 13;13:200. doi: 10.1186/1471-2105-13-200.
8
SEAL: a distributed short read mapping and duplicate removal tool.SEAL:一种分布式短读映射和去重工具。
Bioinformatics. 2011 Aug 1;27(15):2159-60. doi: 10.1093/bioinformatics/btr325. Epub 2011 Jun 22.
9
Galaxy CloudMan: delivering cloud compute clusters.星系云人:提供云计算集群。
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12(Suppl 12):S4. doi: 10.1186/1471-2105-11-S12-S4.
10
An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics.Hadoop/MapReduce/HBase 框架概述及其在生物信息学中的当前应用。
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12(Suppl 12):S1. doi: 10.1186/1471-2105-11-S12-S1.