Passerat-Palmbach Jonathan, Reuillon Romain, Leclaire Mathieu, Makropoulos Antonios, Robinson Emma C, Parisot Sarah, Rueckert Daniel
BioMedIA Group, Department of Computing, Imperial College London London, UK.
Institut des Systemes Complexes Paris Ile de France Paris, France.
Front Neuroinform. 2017 Mar 22;11:21. doi: 10.3389/fninf.2017.00021. eCollection 2017.
OpenMOLE is a scientific workflow engine with a strong emphasis on workload distribution. Workflows are designed using a high level Domain Specific Language (DSL) built on top of Scala. It exposes natural parallelism constructs to easily delegate the workload resulting from a workflow to a wide range of distributed computing environments. OpenMOLE hides the complexity of designing complex experiments thanks to its DSL. Users can embed their own applications and scale their pipelines from a small prototype running on their desktop computer to a large-scale study harnessing distributed computing infrastructures, simply by changing a single line in the pipeline definition. The construction of the pipeline itself is decoupled from the execution context. The high-level DSL abstracts the underlying execution environment, contrary to classic shell-script based pipelines. These two aspects allow pipelines to be shared and studies to be replicated across different computing environments. Workflows can be run as traditional batch pipelines or coupled with OpenMOLE's advanced exploration methods in order to study the behavior of an application, or perform automatic parameter tuning. In this work, we briefly present the strong assets of OpenMOLE and detail recent improvements targeting re-executability of workflows across various Linux platforms. We have tightly coupled OpenMOLE with CARE, a standalone containerization solution that allows re-executing on a Linux host any application that has been packaged on another Linux host previously. The solution is evaluated against a Python-based pipeline involving packages such as scikit-learn as well as binary dependencies. All were packaged and re-executed successfully on various HPC environments, with identical numerical results (here prediction scores) obtained on each environment. Our results show that the pair formed by OpenMOLE and CARE is a reliable solution to generate reproducible results and re-executable pipelines. A demonstration of the flexibility of our solution showcases three neuroimaging pipelines harnessing distributed computing environments as heterogeneous as local clusters or the European Grid Infrastructure (EGI).
OpenMOLE是一个科学工作流引擎,特别强调工作负载分布。工作流使用基于Scala构建的高级领域特定语言(DSL)进行设计。它暴露了自然的并行结构,以便轻松地将工作流产生的工作负载委托给各种分布式计算环境。由于其DSL,OpenMOLE隐藏了设计复杂实验的复杂性。用户可以嵌入自己的应用程序,并将其管道从在桌面计算机上运行的小型原型扩展到利用分布式计算基础设施的大规模研究,只需在管道定义中更改一行即可。管道本身的构建与执行上下文解耦。与基于经典 shell 脚本的管道相反,高级DSL抽象了底层执行环境。这两个方面允许管道被共享,并且研究可以在不同的计算环境中被复制。工作流可以作为传统的批处理管道运行,或者与OpenMOLE的高级探索方法相结合,以研究应用程序的行为,或执行自动参数调整。在这项工作中,我们简要介绍了OpenMOLE的强大优势,并详细介绍了针对跨各种Linux平台的工作流重新执行能力的最新改进。我们已将OpenMOLE与CARE紧密结合,CARE是一种独立的容器化解决方案,可在Linux主机上重新执行之前在另一台Linux主机上打包的任何应用程序。该解决方案针对一个基于Python的管道进行了评估,该管道涉及诸如scikit-learn等包以及二进制依赖项。所有这些都在各种HPC环境中成功打包并重新执行,在每个环境中获得了相同的数值结果(此处为预测分数)。我们的结果表明,由OpenMOLE和CARE组成的组合是生成可重现结果和可重新执行管道的可靠解决方案。我们解决方案灵活性的演示展示了三个利用如本地集群或欧洲网格基础设施(EGI)等异构分布式计算环境的神经成像管道。