SparkBLAST：使用内存操作的可扩展BLAST处理

SparkBLAST: scalable BLAST processing using in-memory operations.

作者信息

de Castro Marcelo Rodrigo, Tostes Catherine Dos Santos, Dávila Alberto M R, Senger Hermes, da Silva Fabricio A B

机构信息

Computer Science Department, Federal University of São Carlos, Rod. Washington Luís, Km 235, São Carlos, 21040-900, Brazil.

LBCS-IOC, Oswaldo Cruz Foundation, Av Brasil 4365, Rio de Janeiro, 21040-900, Brazil.

出版信息

BMC Bioinformatics. 2017 Jun 27;18(1):318. doi: 10.1186/s12859-017-1723-8.

DOI:10.1186/s12859-017-1723-8

PMID:28655296

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5488373/

Abstract

BACKGROUND

The demand for processing ever increasing amounts of genomic data has raised new challenges for the implementation of highly scalable and efficient computational systems. In this paper we propose SparkBLAST, a parallelization of a sequence alignment application (BLAST) that employs cloud computing for the provisioning of computational resources and Apache Spark as the coordination framework. As a proof of concept, some radionuclide-resistant bacterial genomes were selected for similarity analysis.

RESULTS

Experiments in Google and Microsoft Azure clouds demonstrated that SparkBLAST outperforms an equivalent system implemented on Hadoop in terms of speedup and execution times.

CONCLUSIONS

The superior performance of SparkBLAST is mainly due to the in-memory operations available through the Spark framework, consequently reducing the number of local I/O operations required for distributed BLAST processing.

摘要

背景

处理数量不断增加的基因组数据的需求给实现高度可扩展且高效的计算系统带来了新挑战。在本文中，我们提出了SparkBLAST，这是一种序列比对应用程序（BLAST）的并行化方案，它利用云计算来提供计算资源，并以Apache Spark作为协调框架。作为概念验证，我们选择了一些抗放射性核素的细菌基因组进行相似性分析。

结果

在谷歌云和微软Azure云中进行的实验表明，在加速比和执行时间方面，SparkBLAST优于在Hadoop上实现的等效系统。

结论

SparkBLAST的卓越性能主要归因于通过Spark框架实现的内存内操作，从而减少了分布式BLAST处理所需的本地I/O操作数量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ce7/5488373/e86bfc3d6402/12859_2017_1723_Fig1_HTML.jpg

相似文献

SparkBLAST: scalable BLAST processing using in-memory operations.SparkBLAST：使用内存操作的可扩展BLAST处理

BMC Bioinformatics. 2017 Jun 27;18(1):318. doi: 10.1186/s12859-017-1723-8.

HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.HBLAST：并行化序列相似性——一种可通过Hadoop进行MapReduce的基本局部比对搜索工具。

J Biomed Inform. 2015 Apr;54:58-64. doi: 10.1016/j.jbi.2015.01.008. Epub 2015 Jan 24.

Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism.快速的云端蛋白质序列比对：HAMOND 将快速的 DIAMOND 比对与 Hadoop 并行处理相结合。

J Biotechnol. 2017 Sep 10;257:58-60. doi: 10.1016/j.jbiotec.2017.02.020. Epub 2017 Feb 21.

Distributed gene clinical decision support system based on cloud computing.基于云计算的分布式基因临床决策支持系统

BMC Med Genomics. 2018 Nov 20;11(Suppl 5):100. doi: 10.1186/s12920-018-0415-1.

Cloud-native distributed genomic pileup operations.云原生分布式基因组堆积操作。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac804.

Bioinformatics applications on Apache Spark.基于 Apache Spark 的生物信息学应用。

Gigascience. 2018 Aug 1;7(8):giy098. doi: 10.1093/gigascience/giy098.

CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce.CloudDOE：一款用于部署Hadoop云并使用MapReduce分析高通量测序数据的用户友好型工具。

PLoS One. 2014 Jun 4;9(6):e98146. doi: 10.1371/journal.pone.0098146. eCollection 2014.

Parallel MapReduce: Maximizing Cloud Resource Utilization and Performance Improvement Using Parallel Execution Strategies.并行 MapReduce：利用并行执行策略最大化云资源利用率和提升性能。

Biomed Res Int. 2018 Oct 17;2018:7501042. doi: 10.1155/2018/7501042. eCollection 2018.

ADS-HCSpark: A scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on Spark.ADS-HCSpark：一种可扩展的基于 Spark 的单倍型调用程序，利用自适应数据分段来加速变异调用。

BMC Bioinformatics. 2019 Feb 14;20(1):76. doi: 10.1186/s12859-019-2665-0.

BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data.BAMSI：一个用于大规模基因组数据可扩展分布式过滤的多云服务。

BMC Bioinformatics. 2018 Jun 26;19(1):240. doi: 10.1186/s12859-018-2241-z.

引用本文的文献

ElasticBLAST: accelerating sequence search via cloud computing.ElasticBLAST：通过云计算加速序列搜索。

BMC Bioinformatics. 2023 Mar 26;24(1):117. doi: 10.1186/s12859-023-05245-9.

ElasticBLAST: Accelerating Sequence Search via Cloud Computing.ElasticBLAST：通过云计算加速序列搜索。

bioRxiv. 2023 Jan 4:2023.01.04.522777. doi: 10.1101/2023.01.04.522777.

Scalable in-memory processing of omics workflows.组学工作流程的可扩展内存处理。

Comput Struct Biotechnol J. 2022 Apr 20;20:1914-1924. doi: 10.1016/j.csbj.2022.04.014. eCollection 2022.

BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data.BigFiRSt：一种使用大数据技术从大规模测序数据中挖掘简单序列重复序列的软件程序。

Front Big Data. 2022 Jan 18;4:727216. doi: 10.3389/fdata.2021.727216. eCollection 2021.

iBLAST: Incremental BLAST of new sequences via automated e-value correction.iBLAST：通过自动 e 值校正对新序列进行增量 BLAST。

PLoS One. 2021 Apr 22;16(4):e0249410. doi: 10.1371/journal.pone.0249410. eCollection 2021.

Hypothetical Proteins as Predecessors of Long Non-coding RNAs.作为长链非编码RNA前身的假设性蛋白质。

Curr Genomics. 2020 Nov;21(7):531-535. doi: 10.2174/1389202921999200611155418.

pmTM-align: scalable pairwise and multiple structure alignment with Apache Spark and OpenMP.pmTM-align：基于 Apache Spark 和 OpenMP 的可扩展的两两和多重结构比对。

BMC Bioinformatics. 2020 Sep 29;21(1):426. doi: 10.1186/s12859-020-03757-2.

A Genocentric Approach to Discovery of Mendelian Disorders.从种族中心主义角度探究孟德尔遗传病

Am J Hum Genet. 2019 Nov 7;105(5):974-986. doi: 10.1016/j.ajhg.2019.09.027. Epub 2019 Oct 24.

Bioinformatics applications on Apache Spark.基于 Apache Spark 的生物信息学应用。

Gigascience. 2018 Aug 1;7(8):giy098. doi: 10.1093/gigascience/giy098.

本文引用的文献

'Big data', Hadoop and cloud computing in genomics.基因组学中的“大数据”、Hadoop 和云计算。

J Biomed Inform. 2013 Oct;46(5):774-81. doi: 10.1016/j.jbi.2013.07.001. Epub 2013 Jul 18.

Bioremediation: a genuine technology to remediate radionuclides from the environment.生物修复：一种从环境中修复放射性核素的真正技术。

Microb Biotechnol. 2013 Jul;6(4):349-60. doi: 10.1111/1751-7915.12059. Epub 2013 Apr 26.

PeakRanger: a cloud-enabled peak caller for ChIP-seq data.PeakRanger：一个用于 ChIP-seq 数据的云启用峰调用程序。

BMC Bioinformatics. 2011 May 9;12:139. doi: 10.1186/1471-2105-12-139.

Cloud-scale RNA-sequencing differential expression analysis with Myrna.利用 Myrna 进行云规模 RNA-seq 差异表达分析。

Genome Biol. 2010;11(8):R83. doi: 10.1186/gb-2010-11-8-r83. Epub 2010 Aug 11.

BLAST+: architecture and applications.BLAST+：体系结构与应用。

BMC Bioinformatics. 2009 Dec 15;10:421. doi: 10.1186/1471-2105-10-421.

OrthoMCL: identification of ortholog groups for eukaryotic genomes.OrthoMCL：真核生物基因组直系同源组的鉴定

Genome Res. 2003 Sep;13(9):2178-89. doi: 10.1101/gr.1224503.

Predicting function: from genes to genomes and back.预测功能：从基因到基因组，再回归基因。

J Mol Biol. 1998 Nov 6;283(4):707-25. doi: 10.1006/jmbi.1998.2144.

A genomic perspective on protein families.蛋白质家族的基因组视角。

Science. 1997 Oct 24;278(5338):631-7. doi: 10.1126/science.278.5338.631.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.空位BLAST和位置特异性迭代BLAST：新一代蛋白质数据库搜索程序。

Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. doi: 10.1093/nar/25.17.3389.

Basic local alignment search tool.基本局部比对搜索工具

J Mol Biol. 1990 Oct 5;215(3):403-10. doi: 10.1016/S0022-2836(05)80360-2.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

SparkBLAST：使用内存操作的可扩展BLAST处理

SparkBLAST: scalable BLAST processing using in-memory operations.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献