de Castro Marcelo Rodrigo, Tostes Catherine Dos Santos, Dávila Alberto M R, Senger Hermes, da Silva Fabricio A B
Computer Science Department, Federal University of São Carlos, Rod. Washington Luís, Km 235, São Carlos, 21040-900, Brazil.
LBCS-IOC, Oswaldo Cruz Foundation, Av Brasil 4365, Rio de Janeiro, 21040-900, Brazil.
BMC Bioinformatics. 2017 Jun 27;18(1):318. doi: 10.1186/s12859-017-1723-8.
The demand for processing ever increasing amounts of genomic data has raised new challenges for the implementation of highly scalable and efficient computational systems. In this paper we propose SparkBLAST, a parallelization of a sequence alignment application (BLAST) that employs cloud computing for the provisioning of computational resources and Apache Spark as the coordination framework. As a proof of concept, some radionuclide-resistant bacterial genomes were selected for similarity analysis.
Experiments in Google and Microsoft Azure clouds demonstrated that SparkBLAST outperforms an equivalent system implemented on Hadoop in terms of speedup and execution times.
The superior performance of SparkBLAST is mainly due to the in-memory operations available through the Spark framework, consequently reducing the number of local I/O operations required for distributed BLAST processing.
处理数量不断增加的基因组数据的需求给实现高度可扩展且高效的计算系统带来了新挑战。在本文中,我们提出了SparkBLAST,这是一种序列比对应用程序(BLAST)的并行化方案,它利用云计算来提供计算资源,并以Apache Spark作为协调框架。作为概念验证,我们选择了一些抗放射性核素的细菌基因组进行相似性分析。
在谷歌云和微软Azure云中进行的实验表明,在加速比和执行时间方面,SparkBLAST优于在Hadoop上实现的等效系统。
SparkBLAST的卓越性能主要归因于通过Spark框架实现的内存内操作,从而减少了分布式BLAST处理所需的本地I/O操作数量。