• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于容器的生物信息学与 Pachyderm。

Container-based bioinformatics with Pachyderm.

机构信息

Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Uppsala, Sweden.

Department of Medical Sciences, Clinical Chemistry, Uppsala University, Uppsala, Sweden.

出版信息

Bioinformatics. 2019 Mar 1;35(5):839-846. doi: 10.1093/bioinformatics/bty699.

DOI:10.1093/bioinformatics/bty699
PMID:30101309
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6394392/
Abstract

MOTIVATION

Computational biologists face many challenges related to data size, and they need to manage complicated analyses often including multiple stages and multiple tools, all of which must be deployed to modern infrastructures. To address these challenges and maintain reproducibility of results, researchers need (i) a reliable way to run processing stages in any computational environment, (ii) a well-defined way to orchestrate those processing stages and (iii) a data management layer that tracks data as it moves through the processing pipeline.

RESULTS

Pachyderm is an open-source workflow system and data management framework that fulfils these needs by creating a data pipelining and data versioning layer on top of projects from the container ecosystem, having Kubernetes as the backbone for container orchestration. We adapted Pachyderm and demonstrated its attractive properties in bioinformatics. A Helm Chart was created so that researchers can use Pachyderm in multiple scenarios. The Pachyderm File System was extended to support block storage. A wrapper for initiating Pachyderm on cloud-agnostic virtual infrastructures was created. The benefits of Pachyderm are illustrated via a large metabolomics workflow, demonstrating that Pachyderm enables efficient and sustainable data science workflows while maintaining reproducibility and scalability.

AVAILABILITY AND IMPLEMENTATION

Pachyderm is available from https://github.com/pachyderm/pachyderm. The Pachyderm Helm Chart is available from https://github.com/kubernetes/charts/tree/master/stable/pachyderm. Pachyderm is available out-of-the-box from the PhenoMeNal VRE (https://github.com/phnmnl/KubeNow-plugin) and general Kubernetes environments instantiated via KubeNow. The code of the workflow used for the analysis is available on GitHub (https://github.com/pharmbio/LC-MS-Pachyderm).

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

计算生物学家面临许多与数据大小相关的挑战,他们需要管理复杂的分析,通常包括多个阶段和多个工具,所有这些都必须部署到现代基础架构中。为了应对这些挑战并保持结果的可重复性,研究人员需要 (i) 在任何计算环境中运行处理阶段的可靠方法,(ii) 定义良好的方法来协调这些处理阶段,以及 (iii) 一个数据管理层,用于跟踪数据在处理管道中的移动方式。

结果

Pachyderm 是一个开源工作流系统和数据管理框架,它通过在容器生态系统中的项目之上创建一个数据管道和数据版本控制层,并以 Kubernetes 作为容器编排的骨干,满足了这些需求。我们对 Pachyderm 进行了改编,并在生物信息学中展示了其吸引人的特性。创建了一个 Helm 图表,以便研究人员可以在多种场景中使用 Pachyderm。扩展了 Pachyderm 文件系统以支持块存储。创建了一个用于在无云虚拟基础架构上启动 Pachyderm 的包装器。通过一个大型代谢组学工作流程说明了 Pachyderm 的优势,表明 Pachyderm 可以在保持可重复性和可扩展性的同时,实现高效和可持续的数据科学工作流程。

可用性和实现

Pachyderm 可从 https://github.com/pachyderm/pachyderm 获得。Pachyderm Helm 图表可从 https://github.com/kubernetes/charts/tree/master/stable/pachyderm 获得。Pachyderm 可从 PhenoMeNal VRE(https://github.com/phnmnl/KubeNow-plugin)和通过 KubeNow 实例化的一般 Kubernetes 环境中获得。用于分析的工作流程的代码可在 GitHub 上获得(https://github.com/pharmbio/LC-MS-Pachyderm)。

补充信息

补充数据可在《生物信息学》在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fce2/6394392/c0adb33da7be/bty699f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fce2/6394392/0b0976ebda77/bty699f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fce2/6394392/0dfaa4848947/bty699f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fce2/6394392/45bf8e955afd/bty699f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fce2/6394392/c0adb33da7be/bty699f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fce2/6394392/0b0976ebda77/bty699f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fce2/6394392/0dfaa4848947/bty699f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fce2/6394392/45bf8e955afd/bty699f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fce2/6394392/c0adb33da7be/bty699f4.jpg

相似文献

1
Container-based bioinformatics with Pachyderm.基于容器的生物信息学与 Pachyderm。
Bioinformatics. 2019 Mar 1;35(5):839-846. doi: 10.1093/bioinformatics/bty699.
2
Interoperable and scalable data analysis with microservices: applications in metabolomics.基于微服务的可互操作和可扩展数据分析:在代谢组学中的应用。
Bioinformatics. 2019 Oct 1;35(19):3752-3760. doi: 10.1093/bioinformatics/btz160.
3
PhenoMeNal: processing and analysis of metabolomics data in the cloud.PhenoMeNal:云端代谢组学数据的处理和分析。
Gigascience. 2019 Feb 1;8(2). doi: 10.1093/gigascience/giy149.
4
Automated workflow composition in mass spectrometry-based proteomics.基于质谱的蛋白质组学中的自动化工作流组合。
Bioinformatics. 2019 Feb 15;35(4):656-664. doi: 10.1093/bioinformatics/bty646.
5
Sequence database versioning for command line and Galaxy bioinformatics servers.用于命令行和Galaxy生物信息学服务器的序列数据库版本控制。
Bioinformatics. 2016 Apr 15;32(8):1275-7. doi: 10.1093/bioinformatics/btv724. Epub 2015 Dec 12.
6
Galaxy Helm chart: a standardized method for deploying production Galaxy servers.Galaxy 图:一种部署生产 Galaxy 服务器的标准化方法。
Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae486.
7
SPIM workflow manager for HPC.用于高性能计算的 SPIM 工作流管理器。
Bioinformatics. 2019 Oct 1;35(19):3875-3876. doi: 10.1093/bioinformatics/btz140.
8
Tibanna: software for scalable execution of portable pipelines on the cloud.Tibanna:用于在云端可扩展执行可移植管道的软件。
Bioinformatics. 2019 Nov 1;35(21):4424-4426. doi: 10.1093/bioinformatics/btz379.
9
AlgoRun: a Docker-based packaging system for platform-agnostic implemented algorithms.AlgoRun:一种用于与平台无关的已实现算法的基于Docker的打包系统。
Bioinformatics. 2016 Aug 1;32(15):2396-8. doi: 10.1093/bioinformatics/btw120. Epub 2016 Mar 2.
10
Phylesystem: a git-based data store for community-curated phylogenetic estimates.系统发育体系:一个基于Git的用于社区策划系统发育估计的数据存储库。
Bioinformatics. 2015 Sep 1;31(17):2794-800. doi: 10.1093/bioinformatics/btv276. Epub 2015 May 4.

引用本文的文献

1
Empowering bioinformatics communities with Nextflow and nf-core.借助Nextflow和nf-core助力生物信息学社区。
Genome Biol. 2025 Jul 29;26(1):228. doi: 10.1186/s13059-025-03673-9.
2
Edge, Fog, and Cloud Against Disease: The Potential of High-Performance Cloud Computing for Pharma Drug Discovery.边缘计算、雾计算和云计算对抗疾病:高性能云计算在制药药物发现中的潜力。
Methods Mol Biol. 2024;2716:181-202. doi: 10.1007/978-1-0716-3449-3_8.
3
Toward a data infrastructure for the Plant Cell Atlas.迈向植物细胞图谱的数据基础设施。

本文引用的文献

1
Practical Computational Reproducibility in the Life Sciences.生命科学中的实用计算可重复性。
Cell Syst. 2018 Jun 27;6(6):631-635. doi: 10.1016/j.cels.2018.03.014.
2
Software simplified.软件简化。
Nature. 2017 May 29;546(7656):173-174. doi: 10.1038/546173a.
3
Nextflow enables reproducible computational workflows.Nextflow支持可重复的计算工作流程。
Plant Physiol. 2023 Jan 2;191(1):35-46. doi: 10.1093/plphys/kiac468.
4
Scalable in-memory processing of omics workflows.组学工作流程的可扩展内存处理。
Comput Struct Biotechnol J. 2022 Apr 20;20:1914-1924. doi: 10.1016/j.csbj.2022.04.014. eCollection 2022.
5
NPARS-A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science.NPARS——一种解决基因组数据科学中准确性和可重复性问题的新方法。
Front Big Data. 2021 Sep 27;4:725095. doi: 10.3389/fdata.2021.725095. eCollection 2021.
6
Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers.使用生物信息学工作流管理器的可重复、可扩展且可共享的分析管道。
Nat Methods. 2021 Oct;18(10):1161-1168. doi: 10.1038/s41592-021-01254-9. Epub 2021 Sep 23.
7
Evaluation of serverless computing for scalable execution of a joint variant calling workflow.评估无服务器计算在联合变异调用工作流可伸缩执行中的应用。
PLoS One. 2021 Jul 9;16(7):e0254363. doi: 10.1371/journal.pone.0254363. eCollection 2021.
8
On-demand virtual research environments using microservices.使用微服务的按需虚拟研究环境。
PeerJ Comput Sci. 2019 Nov 11;5:e232. doi: 10.7717/peerj-cs.232. eCollection 2019.
9
Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit.利用 HASTE 工具包快速开发用于科学数据流的云原生智能数据管道。
Gigascience. 2021 Mar 19;10(3). doi: 10.1093/gigascience/giab018.
10
MaRe: Processing Big Data with application containers on Apache Spark.MaRe:在 Apache Spark 上使用应用程序容器处理大数据。
Gigascience. 2020 May 1;9(5). doi: 10.1093/gigascience/giaa042.
Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820.
4
BioContainers: an open-source and community-driven framework for software standardization.生物容器:一个开源且由社区驱动的软件标准化框架。
Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192.
5
Use of application containers and workflows for genomic data analysis.应用容器和工作流程在基因组数据分析中的应用。
J Pathol Inform. 2016 Dec 30;7:53. doi: 10.4103/2153-3539.197197. eCollection 2016.
6
CymeR: cytometry analysis using KNIME, docker and R.CymeR:使用KNIME、Docker和R进行细胞计数分析
Bioinformatics. 2017 Mar 1;33(5):776-778. doi: 10.1093/bioinformatics/btw707.
7
The hard road to reproducibility.通往可重复性的艰难之路。
Science. 2016 Oct 7;354(6308):142. doi: 10.1126/science.354.6308.142.
8
OpenMS: a flexible open-source software platform for mass spectrometry data analysis.OpenMS:一个灵活的开源质谱数据分析软件平台。
Nat Methods. 2016 Aug 30;13(9):741-8. doi: 10.1038/nmeth.3959.
9
The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update.用于可访问、可重复和协作式生物医学分析的Galaxy平台:2016年更新
Nucleic Acids Res. 2016 Jul 8;44(W1):W3-W10. doi: 10.1093/nar/gkw343. Epub 2016 May 2.
10
The impact of Docker containers on the performance of genomic pipelines.Docker容器对基因组分析流程性能的影响。
PeerJ. 2015 Sep 24;3:e1273. doi: 10.7717/peerj.1273. eCollection 2015.