Suppr超能文献

MaRe:在 Apache Spark 上使用应用程序容器处理大数据。

MaRe: Processing Big Data with application containers on Apache Spark.

机构信息

Department of Information Technology, Uppsala University, Box 337, 75105, Uppsala, Sweden.

Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 751 24, Uppsala, Sweden.

出版信息

Gigascience. 2020 May 1;9(5). doi: 10.1093/gigascience/giaa042.

Abstract

BACKGROUND

Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing.

RESULTS

Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability.

CONCLUSIONS

MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.

摘要

背景

生命科学越来越受到大数据分析的驱动,并且已经证明 MapReduce 编程模型非常适用于数据密集型分析。然而,当前的 MapReduce 框架在生物信息学管道中重用现有处理工具方面提供的支持较差。此外,这些框架不支持应用程序容器,而容器在科学数据处理中越来越受欢迎。

结果

我们在这里介绍了 MaRe,这是一个开源编程库,它在 Apache Spark 中引入了对 Docker 容器的支持。Apache Spark 和 Docker 是拥有最大开源社区的 MapReduce 框架和容器引擎;因此,MaRe 提供了与最先进的软件生态系统的互操作性。我们在生命科学中的 2 个数据密集型应用程序上演示了 MaRe,展示了其易用性和可扩展性。

结论

MaRe 通过 Apache Spark 和应用程序容器为生命科学提供了可扩展的数据密集型处理。与当前涉及使用工作流系统的最佳实践相比,MaRe 具有提供数据本地化、从异构存储系统摄取数据以及进行交互式处理的优势。MaRe 具有通用性,并且可作为开源软件使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ee8f/7199472/de99e93efe11/giaa042fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验