MaRe：在 Apache Spark 上使用应用程序容器处理大数据。

MaRe: Processing Big Data with application containers on Apache Spark.

机构信息

Department of Information Technology, Uppsala University, Box 337, 75105, Uppsala, Sweden.

Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 751 24, Uppsala, Sweden.

出版信息

Gigascience. 2020 May 1;9(5). doi: 10.1093/gigascience/giaa042.

DOI:10.1093/gigascience/giaa042

PMID:32369166

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7199472/

Abstract

BACKGROUND

Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing.

RESULTS

Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability.

CONCLUSIONS

MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.

摘要

背景

生命科学越来越受到大数据分析的驱动，并且已经证明 MapReduce 编程模型非常适用于数据密集型分析。然而，当前的 MapReduce 框架在生物信息学管道中重用现有处理工具方面提供的支持较差。此外，这些框架不支持应用程序容器，而容器在科学数据处理中越来越受欢迎。

结果

我们在这里介绍了 MaRe，这是一个开源编程库，它在 Apache Spark 中引入了对 Docker 容器的支持。Apache Spark 和 Docker 是拥有最大开源社区的 MapReduce 框架和容器引擎；因此，MaRe 提供了与最先进的软件生态系统的互操作性。我们在生命科学中的 2 个数据密集型应用程序上演示了 MaRe，展示了其易用性和可扩展性。