Suppr超能文献

基于云计算环境的大规模并行基因组组装器。

Large-scale parallel genome assembler over cloud computing environment.

作者信息

Das Arghya Kusum, Koppa Praveen Kumar, Goswami Sayan, Platania Richard, Park Seung-Jong

机构信息

1 School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, 340 East Parker Blvd, Baton Rouge, Louisiana 70803, USA.

出版信息

J Bioinform Comput Biol. 2017 Jun;15(3):1740003. doi: 10.1142/S0219720017400030. Epub 2017 May 23.

Abstract

The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.

摘要

高通量DNA测序数据的规模已经达到了太字节级别。为了管理如此庞大的数据量,许多下游测序应用开始在不同的云基础设施上使用基于局部性的计算,以便以较低的成本利用弹性(按需付费)资源。然而,基于局部性的编程模型(如MapReduce)相对较新。因此,使用该模型开发可扩展的数据密集型生物信息学应用程序,以及了解这些应用程序为实现良好性能所需的硬件环境,都需要进一步研究。在本文中,我们提出了一种面向德布鲁因图的基于并行Giraph的基因组组装器(GiGA),以及其实现最佳性能所需的硬件平台。GiGA利用Hadoop(MapReduce)和Giraph(大规模图分析)的能力,通过将计算和数据放置在一起,在数百个计算节点上实现了高可扩展性。与传统HPC集群上的当代并行组装器(如ABySS和Contrail)相比,GiGA在组装质量具有竞争力的情况下实现了显著更高的可扩展性。此外,我们表明,与传统HPC集群相比,使用基于固态硬盘的私有云基础设施可显著提高GiGA的性能。我们观察到,GiGA在这种基于固态硬盘的云基础设施的256个核心上的性能与传统HPC集群的512个核心的性能非常接近。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验