Allain Fabrice, Roméjon Julien, La Rosa Philippe, Jarlier Frédéric, Servant Nicolas, Hupé Philippe
Mines Paris Tech, Fontainebleau, F-77305, France.
Institut Curie, Paris, F-75005, France.
Open Res Eur. 2022 Feb 21;1:76. doi: 10.12688/openreseurope.13861.2. eCollection 2021.
With the advent of high-throughput biotechnological platforms and their ever-growing capacity, life science has turned into a digitized, computational and data-intensive discipline. As a consequence, standard analysis with a bioinformatics pipeline in the context of routine production has become a challenge such that the data can be processed in real-time and delivered to the end-users as fast as possible. The usage of workflow management systems along with packaging systems and containerization technologies offer an opportunity to tackle this challenge. While very powerful, they can be used and combined in many multiple ways which may differ from one developer to another. Therefore, promoting the homogeneity of the workflow implementation requires guidelines and protocols which detail how the source code of the bioinformatics pipeline should be written and organized to ensure its usability, maintainability, interoperability, sustainability, portability, reproducibility, scalability and efficiency. Capitalizing on Nextflow, Conda, Docker, Singularity and the nf-core initiative, we propose a set of best practices along the development life cycle of the bioinformatics pipeline and deployment for production operations which target different expert communities including i) the bioinformaticians and statisticians ii) the software engineers and iii) the data managers and core facility engineers. We implemented Geniac (Automatic Configuration GENerator and Installer for nextflow pipelines) which consists of a toolbox with three components: i) a technical documentation available at https://geniac.readthedocs.io to detail coding guidelines for the bioinformatics pipeline with Nextflow, ii) a command line interface with a linter to check that the code respects the guidelines, and iii) an add-on to generate configuration files, build the containers and deploy the pipeline. The Geniac toolbox aims at the harmonization of development practices across developers and automation of the generation of configuration files and containers by parsing the source code of the Nextflow pipeline.
随着高通量生物技术平台的出现及其能力的不断增长,生命科学已转变为一门数字化、计算密集型和数据密集型学科。因此,在常规生产环境中使用生物信息学管道进行标准分析已成为一项挑战,即数据能够实时处理并尽快交付给最终用户。工作流管理系统与打包系统和容器化技术的结合使用为应对这一挑战提供了契机。虽然它们功能强大,但使用方式和组合方式多种多样,不同开发者可能有所不同。因此,促进工作流实施的同质性需要详细说明生物信息学管道源代码应如何编写和组织以确保其可用性、可维护性、互操作性、可持续性、可移植性、可重复性、可扩展性和效率的指导方针和协议。利用Nextflow、Conda、Docker、Singularity和nf-core计划,我们针对不同的专家群体,在生物信息学管道的开发生命周期和生产运营部署过程中提出了一套最佳实践,这些专家群体包括:(i)生物信息学家和统计学家;(ii)软件工程师;以及(iii)数据管理人员和核心设施工程师。我们实现了Geniac(Nextflow管道自动配置生成器和安装程序),它由一个包含三个组件的工具箱组成:(i)一个技术文档,可在https://geniac.readthedocs.io获取,详细介绍使用Nextflow编写生物信息学管道的编码指南;(ii)一个带有代码检查器的命令行界面,用于检查代码是否符合指南;以及(iii)一个插件,用于生成配置文件、构建容器和部署管道。Geniac工具箱旨在通过解析Nextflow管道的源代码,使不同开发者的开发实践趋于一致,并实现配置文件和容器生成的自动化。