• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

码头仓库:实现基于Docker的基因组学工具和工作流程的模块化、以社区为中心的共享。

The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows.

作者信息

O'Connor Brian D, Yuen Denis, Chung Vincent, Duncan Andrew G, Liu Xiang Kun, Patricia Janice, Paten Benedict, Stein Lincoln, Ferretti Vincent

机构信息

UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.

Ontario Institute for Cancer Research, MaRS Centre, Toronto, Canada.

出版信息

F1000Res. 2017 Jan 18;6:52. doi: 10.12688/f1000research.10137.1. eCollection 2017.

DOI:10.12688/f1000research.10137.1
PMID:28344774
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5333608/
Abstract

As genomic datasets continue to grow, the feasibility of downloading data to a local organization and running analysis on a traditional compute environment is becoming increasingly problematic. Current large-scale projects, such as the ICGC PanCancer Analysis of Whole Genomes (PCAWG), the Data Platform for the U.S. Precision Medicine Initiative, and the NIH Big Data to Knowledge Center for Translational Genomics, are using cloud-based infrastructure to both host and perform analysis across large data sets. In PCAWG, over 5,800 whole human genomes were aligned and variant called across 14 cloud and HPC environments; the processed data was then made available on the cloud for further analysis and sharing. If run locally, an operation at this scale would have monopolized a typical academic data centre for many months, and would have presented major challenges for data storage and distribution. However, this scale is increasingly typical for genomics projects and necessitates a rethink of how analytical tools are packaged and moved to the data. For PCAWG, we embraced the use of highly portable Docker images for encapsulating and sharing complex alignment and variant calling workflows across highly variable environments. While successful, this endeavor revealed a limitation in Docker containers, namely the lack of a standardized way to describe and execute the tools encapsulated inside the container. As a result, we created the Dockstore ( https://dockstore.org), a project that brings together Docker images with standardized, machine-readable ways of describing and running the tools contained within. This service greatly improves the sharing and reuse of genomics tools and promotes interoperability with similar projects through emerging web service standards developed by the Global Alliance for Genomics and Health (GA4GH).

摘要

随着基因组数据集持续增长,将数据下载到本地机构并在传统计算环境中运行分析的可行性正变得越来越成问题。当前的大型项目,如国际癌症基因组联盟全基因组泛癌分析(PCAWG)、美国精准医疗计划的数据平台以及美国国立卫生研究院大数据到知识转化基因组学中心,都在使用基于云的基础设施来托管和分析大型数据集。在PCAWG中,超过5800个人类全基因组在14个云环境和高性能计算环境中进行了比对和变异检测;然后将处理后的数据在云端提供,以供进一步分析和共享。如果在本地运行,如此规模的操作会使一个典型的学术数据中心被占用数月之久,并且会在数据存储和分发方面带来重大挑战。然而,这种规模在基因组学项目中越来越常见,因此有必要重新思考分析工具的打包方式以及如何将其迁移到数据所在之处。对于PCAWG,我们采用了高度可移植的Docker镜像,以便在高度可变的环境中封装和共享复杂的比对和变异检测工作流程。尽管这一尝试取得了成功,但也揭示了Docker容器的一个局限性,即缺乏一种标准化的方式来描述和执行容器内封装的工具。因此,我们创建了Dockstore(https://dockstore.org),该项目将Docker镜像与标准化的、机器可读的方式结合起来,用于描述和运行其中包含的工具。这项服务极大地改善了基因组学工具的共享和重用,并通过全球基因组学与健康联盟(GA4GH)制定的新兴网络服务标准促进了与类似项目的互操作性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4c4f/5333608/c7f3fb0a32f8/f1000research-6-10919-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4c4f/5333608/1c0ef09e3fce/f1000research-6-10919-g0000.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4c4f/5333608/3789299b41ed/f1000research-6-10919-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4c4f/5333608/9ec21e37e9f7/f1000research-6-10919-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4c4f/5333608/c7f3fb0a32f8/f1000research-6-10919-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4c4f/5333608/1c0ef09e3fce/f1000research-6-10919-g0000.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4c4f/5333608/3789299b41ed/f1000research-6-10919-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4c4f/5333608/9ec21e37e9f7/f1000research-6-10919-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4c4f/5333608/c7f3fb0a32f8/f1000research-6-10919-g0003.jpg

相似文献

1
The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows.码头仓库:实现基于Docker的基因组学工具和工作流程的模块化、以社区为中心的共享。
F1000Res. 2017 Jan 18;6:52. doi: 10.12688/f1000research.10137.1. eCollection 2017.
2
The Dockstore: enhancing a community platform for sharing reproducible and accessible computational protocols.Dockstore:增强了一个用于共享可重复和可访问的计算协议的社区平台。
Nucleic Acids Res. 2021 Jul 2;49(W1):W624-W632. doi: 10.1093/nar/gkab346.
3
Global Alliance for Genomics and Health Meets Bioconductor: Toward Reproducible and Agile Cancer Genomics at Cloud Scale.全球基因组与健康联盟与 Bioconductor 会面:致力于在云计算规模上实现可重复和灵活的癌症基因组学。
JCO Clin Cancer Inform. 2020 May;4:472-479. doi: 10.1200/CCI.19.00111.
4
Building Portable and Reproducible Cancer Informatics Workflows: An RNA Sequencing Case Study.构建便携式和可重复的癌症信息学工作流程:一个RNA测序案例研究。
Methods Mol Biol. 2019;1878:39-64. doi: 10.1007/978-1-4939-8868-6_2.
5
Kronos: a workflow assembler for genome analytics and informatics.Kronos:一个用于基因组分析和信息学的工作流组装器。
Gigascience. 2017 Jul 1;6(7):1-10. doi: 10.1093/gigascience/gix042.
6
Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space.与美国国立人类基因组研究所(NHGRI)基因组数据科学分析、可视化和信息学实验室空间一起颠覆基因组学数据共享模式。
Cell Genom. 2022 Jan 12;2(1). doi: 10.1016/j.xgen.2021.100085. Epub 2022 Jan 13.
7
Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics.为异构计算环境开发可重现的生物信息学分析工作流程,以支持非洲基因组学。
BMC Bioinformatics. 2018 Nov 29;19(1):457. doi: 10.1186/s12859-018-2446-1.
8
Genome Annotator Light (GAL): A Docker-based package for genome analysis and visualization.基因组注释器轻量版(GAL):一个基于 Docker 的基因组分析和可视化软件包。
Genomics. 2020 Jan;112(1):127-134. doi: 10.1016/j.ygeno.2019.03.012. Epub 2019 Mar 26.
9
ClinGen advancing genomic data-sharing standards as a GA4GH driver project.ClinGen 推进基因组数据共享标准作为 GA4GH 的驱动项目。
Hum Mutat. 2018 Nov;39(11):1686-1689. doi: 10.1002/humu.23625.
10
Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets.Bionimbus:用于管理、分析和共享大型基因组数据集的云平台。
J Am Med Inform Assoc. 2014 Nov-Dec;21(6):969-75. doi: 10.1136/amiajnl-2013-002155. Epub 2014 Jan 24.

引用本文的文献

1
Building a FAIR data ecosystem for incorporating single-cell transcriptomics data into agricultural genome to phenome research.构建一个用于将单细胞转录组学数据纳入农业基因组到表型组研究的公平数据生态系统。
Front Genet. 2024 Nov 29;15:1460351. doi: 10.3389/fgene.2024.1460351. eCollection 2024.
2
The BRAIN Initiative data-sharing ecosystem: Characteristics, challenges, benefits, and opportunities.大脑倡议数据共享生态系统:特征、挑战、益处和机遇。
Elife. 2024 Nov 27;13:e94000. doi: 10.7554/eLife.94000.
3
Sapporo: A workflow execution service that encourages the reuse of workflows in various languages in bioinformatics.

本文引用的文献

1
BioShaDock: a community driven bioinformatics shared Docker-based tools registry.BioShaDock:一个由社区驱动的基于Docker的生物信息学共享工具注册表。
F1000Res. 2015 Dec 14;4:1443. doi: 10.12688/f1000research.7536.1. eCollection 2015.
2
All the World's a Stage: Facilitating Discovery Science and Improved Cancer Care through the Global Alliance for Genomics and Health.天下舞台:通过全球基因组与健康联盟促进发现科学和改善癌症护理。
Cancer Discov. 2015 Nov;5(11):1133-6. doi: 10.1158/2159-8290.CD-15-0821.
3
Bioboxes: standardised containers for interchangeable bioinformatics software.
札幌:一个工作流执行服务,鼓励在生物信息学中重用各种语言的工作流。
F1000Res. 2024 Jun 24;11:889. doi: 10.12688/f1000research.122924.2. eCollection 2022.
4
Cloud-based large-scale curation of medical imaging data using AI segmentation.使用人工智能分割技术对医学影像数据进行基于云的大规模管理。
Res Sq. 2024 May 3:rs.3.rs-4351526. doi: 10.21203/rs.3.rs-4351526/v1.
5
Phased nanopore assembly with Shasta and modular graph phasing with GFAse.使用Shasta进行分阶段纳米孔组装以及使用GFAse进行模块化图形定相。
Genome Res. 2024 Apr 25;34(3):454-468. doi: 10.1101/gr.278268.123.
6
Recommendations for Uniform Variant Calling of SARS-CoV-2 Genome Sequence across Bioinformatic Workflows.针对 SARS-CoV-2 基因组序列在生物信息学工作流程中的统一变异调用的建议。
Viruses. 2024 Mar 11;16(3):430. doi: 10.3390/v16030430.
7
Challenges and opportunities in sharing microbiome data and analyses.分享微生物组数据和分析的挑战与机遇。
Nat Microbiol. 2023 Nov;8(11):1960-1970. doi: 10.1038/s41564-023-01484-x. Epub 2023 Oct 2.
8
Container Profiler: Profiling resource utilization of containerized big data pipelines.容器分析器:分析容器化大数据管道的资源利用情况。
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad069. Epub 2023 Aug 25.
9
IoT cloud laboratory: Internet of Things architecture for cellular biology.物联网云实验室:细胞生物学的物联网架构
Internet Things (Amst). 2022 Nov;20. doi: 10.1016/j.iot.2022.100618. Epub 2022 Sep 26.
10
Federated Analysis for Privacy-Preserving Data Sharing: A Technical and Legal Primer.联邦分析用于隐私保护的数据共享:技术和法律基础
Annu Rev Genomics Hum Genet. 2023 Aug 25;24:347-368. doi: 10.1146/annurev-genom-110122-084756. Epub 2023 May 30.
生物信息盒:用于可互换生物信息学软件的标准化容器。
Gigascience. 2015 Oct 15;4:47. doi: 10.1186/s13742-015-0087-0. eCollection 2015.
4
Data analysis: Create a cloud commons.数据分析:创建一个云共享空间。
Nature. 2015 Jul 9;523(7559):149-51. doi: 10.1038/523149a.
5
Dissemination of scientific software with Galaxy ToolShed.通过Galaxy工具库传播科学软件。
Genome Biol. 2014 Feb 20;15(2):403. doi: 10.1186/gb4161.