Department of Information Technology, Uppsala University, Box 337, 75105, Uppsala, Sweden.
Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 751 24, Uppsala, Sweden.
Gigascience. 2020 May 1;9(5). doi: 10.1093/gigascience/giaa042.
Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing.
Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability.
MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.
生命科学越来越受到大数据分析的驱动,并且已经证明 MapReduce 编程模型非常适用于数据密集型分析。然而,当前的 MapReduce 框架在生物信息学管道中重用现有处理工具方面提供的支持较差。此外,这些框架不支持应用程序容器,而容器在科学数据处理中越来越受欢迎。
我们在这里介绍了 MaRe,这是一个开源编程库,它在 Apache Spark 中引入了对 Docker 容器的支持。Apache Spark 和 Docker 是拥有最大开源社区的 MapReduce 框架和容器引擎;因此,MaRe 提供了与最先进的软件生态系统的互操作性。我们在生命科学中的 2 个数据密集型应用程序上演示了 MaRe,展示了其易用性和可扩展性。
MaRe 通过 Apache Spark 和应用程序容器为生命科学提供了可扩展的数据密集型处理。与当前涉及使用工作流系统的最佳实践相比,MaRe 具有提供数据本地化、从异构存储系统摄取数据以及进行交互式处理的优势。MaRe 具有通用性,并且可作为开源软件使用。