使用科学工作流系统开发和重用生物信息学数据分析管道。

Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems.

作者信息

Djaffardjy Marine, Marchment George, Sebe Clémence, Blanchet Raphael, Bellajhame Khalid, Gaignard Alban, Lemoine Frédéric, Cohen-Boulakia Sarah

机构信息

Universite Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay 91405, France.

Nantes Université, CNRS, INSERM, l'institut du thorax, 8 quai Moncousu, Nantes F-44000, France.

出版信息

Comput Struct Biotechnol J. 2023 Mar 7;21:2075-2085. doi: 10.1016/j.csbj.2023.03.003. eCollection 2023.

DOI:10.1016/j.csbj.2023.03.003

PMID:36968012

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10030817/

Abstract

Data analysis pipelines are now established as an effective means for specifying and executing bioinformatics data analysis and experiments. While scripting languages, particularly Python, R and notebooks, are popular and sufficient for developing small-scale pipelines that are often intended for a single user, it is now widely recognized that they are by no means enough to support the development of large-scale, shareable, maintainable and reusable pipelines capable of handling large volumes of data and running on high performance computing clusters. This review outlines the key requirements for building large-scale data pipelines and provides a mapping of existing solutions that fulfill them. We then highlight the benefits of using scientific workflow systems to get modular, reproducible and reusable bioinformatics data analysis pipelines. We finally discuss current workflow reuse practices based on an empirical study we performed on a large collection of workflows.

摘要

数据分析管道现已成为指定和执行生物信息学数据分析及实验的有效手段。虽然脚本语言，特别是Python、R和笔记本，很受欢迎且足以开发通常供单个用户使用的小规模管道，但现在人们普遍认识到，它们远远不足以支持开发能够处理大量数据并在高性能计算集群上运行的大规模、可共享、可维护和可重用的管道。本综述概述了构建大规模数据管道的关键要求，并提供了满足这些要求的现有解决方案的映射。然后，我们强调了使用科学工作流系统来获得模块化、可重复和可重用的生物信息学数据分析管道的好处。最后，我们基于对大量工作流进行的实证研究，讨论了当前的工作流重用实践。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ed4/10030817/d86a96492c2f/ga1.jpg

相似文献

Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems.使用科学工作流系统开发和重用生物信息学数据分析管道。

Comput Struct Biotechnol J. 2023 Mar 7;21:2075-2085. doi: 10.1016/j.csbj.2023.03.003. eCollection 2023.

Workflows for microarray data processing in the Kepler environment.在 Kepler 环境中进行微阵列数据处理的工作流程。

BMC Bioinformatics. 2012 May 17;13:102. doi: 10.1186/1471-2105-13-102.

Scalable Workflows and Reproducible Data Analysis for Genomics.基因组学的可扩展工作流程和可重复数据分析

Methods Mol Biol. 2019;1910:723-745. doi: 10.1007/978-1-4939-9074-0_24.

SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines.SciPipe：一个用于敏捷开发复杂和动态生物信息学管道的工作流库。

Gigascience. 2019 May 1;8(5). doi: 10.1093/gigascience/giz044.

NeuroPycon: An open-source python toolbox for fast multi-modal and reproducible brain connectivity pipelines.NeuroPycon：一个开源的 Python 工具包，用于快速进行多模态和可重复的脑连接管道。

Neuroimage. 2020 Oct 1;219:117020. doi: 10.1016/j.neuroimage.2020.117020. Epub 2020 Jun 6.

Reproducible Large-Scale Neuroimaging Studies with the OpenMOLE Workflow Management System.使用OpenMOLE工作流管理系统进行可重复的大规模神经成像研究。

Front Neuroinform. 2017 Mar 22;11:21. doi: 10.3389/fninf.2017.00021. eCollection 2017.

Bpipe: a tool for running and managing bioinformatics pipelines.Bpipe：一种用于运行和管理生物信息学流程的工具。

Bioinformatics. 2012 Jun 1;28(11):1525-6. doi: 10.1093/bioinformatics/bts167. Epub 2012 Apr 12.

A demonstration of modularity, reuse, reproducibility, portability and scalability for modeling and simulation of cardiac electrophysiology using Kepler Workflows.使用 Kepler Workflows 对心脏电生理学进行建模和模拟的模块化、可重用性、可重复性、可移植性和可扩展性演示。

PLoS Comput Biol. 2019 Mar 8;15(3):e1006856. doi: 10.1371/journal.pcbi.1006856. eCollection 2019 Mar.

Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers.使用生物信息学工作流管理器的可重复、可扩展且可共享的分析管道。

Nat Methods. 2021 Oct;18(10):1161-1168. doi: 10.1038/s41592-021-01254-9. Epub 2021 Sep 23.

Using prototyping to choose a bioinformatics workflow management system.使用原型法选择生物信息学工作流管理系统。

PLoS Comput Biol. 2021 Feb 25;17(2):e1008622. doi: 10.1371/journal.pcbi.1008622. eCollection 2021 Feb.

引用本文的文献

Progress and new challenges in image-based profiling.基于图像的分析技术的进展与新挑战。

ArXiv. 2025 Aug 7:arXiv:2508.05800v1.

Gain efficiency with streamlined and automated data processing: Examples from high-throughput monoclonal antibody production.通过简化和自动化数据处理提高效率：来自高通量单克隆抗体制备的示例。

PLoS One. 2025 Jul 1;20(7):e0326678. doi: 10.1371/journal.pone.0326678. eCollection 2025.

Recommendations for Successful Development and Implementation of Digital Health Technology Tools.数字健康技术工具成功开发与实施的建议。

J Med Internet Res. 2025 Jun 11;27:e56747. doi: 10.2196/56747.

KiNext: a portable and scalable workflow for the identification and classification of protein kinases.KiNext：一种用于鉴定和分类蛋白激酶的可移植和可扩展的工作流程。

BMC Bioinformatics. 2024 Oct 25;25(1):338. doi: 10.1186/s12859-024-05953-w.

BioFlow-Insight: facilitating reuse of Nextflow workflows with structure reconstruction and visualization.BioFlow-Insight：通过结构重建和可视化促进Nextflow工作流程的重用。

NAR Genom Bioinform. 2024 Aug 6;6(3):lqae092. doi: 10.1093/nargab/lqae092. eCollection 2024 Sep.

Scalable and versatile container-based pipelines for de novo genome assembly and bacterial annotation.可扩展且通用的基于容器的流水线，用于从头基因组组装和细菌注释。

F1000Res. 2023 Sep 25;12:1205. doi: 10.12688/f1000research.139488.1. eCollection 2023.

CAMP: A modular metagenomics analysis system for integrated multi-step data exploration.CAMP：一个用于集成多步骤数据探索的模块化宏基因组学分析系统。

bioRxiv. 2024 Sep 14:2023.04.09.536171. doi: 10.1101/2023.04.09.536171.

本文引用的文献

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update.Galaxy 平台：用于可访问、可重复和协作的生物医学分析：2022 更新。

Nucleic Acids Res. 2022 Jul 5;50(W1):W345-W351. doi: 10.1093/nar/gkac247.

Ten simple rules for making a software tool workflow-ready.让软件工具适用于工作流程的十条简单规则。

PLoS Comput Biol. 2022 Mar 24;18(3):e1009823. doi: 10.1371/journal.pcbi.1009823. eCollection 2022 Mar.

Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers.使用生物信息学工作流管理器的可重复、可扩展且可共享的分析管道。

Nat Methods. 2021 Oct;18(10):1161-1168. doi: 10.1038/s41592-021-01254-9. Epub 2021 Sep 23.

Fifteen quick tips for success with HPC, i.e., responsibly BASHing that Linux cluster.使用高性能计算（HPC）取得成功的十五个快速提示，即合理地操作那个Linux集群。

PLoS Comput Biol. 2021 Aug 5;17(8):e1009207. doi: 10.1371/journal.pcbi.1009207. eCollection 2021 Aug.

Towards FAIR protocols and workflows: the OpenPREDICT use case.迈向公平的协议和工作流程：OpenPREDICT用例。

PeerJ Comput Sci. 2020 Sep 21;6:e281. doi: 10.7717/peerj-cs.281. eCollection 2020.

Streamlining data-intensive biology with workflow systems.使用工作流程系统简化数据密集型生物学研究。

Gigascience. 2021 Jan 13;10(1). doi: 10.1093/gigascience/giaa140.

State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing.多组学研究领域现状：从计算需求到数据挖掘与共享

Front Genet. 2020 Dec 10;11:610798. doi: 10.3389/fgene.2020.610798. eCollection 2020.

Location of intracranial aneurysms is the main factor associated with rupture in the ICAN population.颅内动脉瘤的位置是与 ICAN 人群破裂相关的主要因素。

J Neurol Neurosurg Psychiatry. 2021 Feb;92(2):122-128. doi: 10.1136/jnnp-2020-324371. Epub 2020 Oct 23.

Seven quick tips for analysis scripts in neuroimaging.神经影像学分析脚本的七个快速提示。

PLoS Comput Biol. 2020 Mar 26;16(3):e1007358. doi: 10.1371/journal.pcbi.1007358. eCollection 2020 Mar.

The nf-core framework for community-curated bioinformatics pipelines.用于社区策划生物信息学流程的nf-core框架。

Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用科学工作流系统开发和重用生物信息学数据分析管道。

Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献