Suppr超能文献

HBLAST:并行化序列相似性——一种可通过Hadoop进行MapReduce的基本局部比对搜索工具。

HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.

作者信息

O'Driscoll Aisling, Belogrudov Vladislav, Carroll John, Kropp Kai, Walsh Paul, Ghazal Peter, Sleator Roy D

机构信息

Department of Computing, Cork Institute of Technology, Rossa Avenue, Bishopstown, Cork, Ireland.

Department of Computing, Cork Institute of Technology, Rossa Avenue, Bishopstown, Cork, Ireland.

出版信息

J Biomed Inform. 2015 Apr;54:58-64. doi: 10.1016/j.jbi.2015.01.008. Epub 2015 Jan 24.

Abstract

The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples.

摘要

基因组数据库最近呈指数级增长,这使得序列比对这一常见任务成为计算生物学领域的主要瓶颈之一。对于这些大型数据集和复杂计算而言,典型情况是需要成本高昂的高性能计算(HPC)才能运行。因此,虽然已经提出了并行化解决方案,但许多方案都存在可扩展性限制,无法有效处理“大数据”——这个术语用于指代极其庞大、复杂且需要快速处理的数据集。由分布式存储和名为MapReduce的并行化编程框架组成的Hadoop框架专门设计用于处理此类数据集,但要根据这种范式高效地重新设计和实现生物信息学算法并非易事。比对算法的“分而治之”并行化策略可应用于数据集和输入查询序列。然而,由于内存限制或大型数据库的原因,可扩展性仍然是一个问题,非常大的数据库分割会导致性能进一步下降。在此,我们提出了Hadoop Blast(HBlast),这是一种并行化的BLAST算法,它提出了一种灵活的方法,即使用“虚拟分区”对数据库和输入查询序列进行分区。HBlast与现有解决方案相比具有更好的可扩展性,计算工作负载平衡,同时将数据库分割和重新编译降至最低。在廉价的内存受限硬件上提高BLAST搜索性能对现场临床诊断测试具有重要意义;能够更快、更准确地识别人类血液或组织样本中的致病DNA。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验