Suppr超能文献

码头仓库:实现基于Docker的基因组学工具和工作流程的模块化、以社区为中心的共享。

The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows.

作者信息

O'Connor Brian D, Yuen Denis, Chung Vincent, Duncan Andrew G, Liu Xiang Kun, Patricia Janice, Paten Benedict, Stein Lincoln, Ferretti Vincent

机构信息

UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA.

Ontario Institute for Cancer Research, MaRS Centre, Toronto, Canada.

出版信息

F1000Res. 2017 Jan 18;6:52. doi: 10.12688/f1000research.10137.1. eCollection 2017.

Abstract

As genomic datasets continue to grow, the feasibility of downloading data to a local organization and running analysis on a traditional compute environment is becoming increasingly problematic. Current large-scale projects, such as the ICGC PanCancer Analysis of Whole Genomes (PCAWG), the Data Platform for the U.S. Precision Medicine Initiative, and the NIH Big Data to Knowledge Center for Translational Genomics, are using cloud-based infrastructure to both host and perform analysis across large data sets. In PCAWG, over 5,800 whole human genomes were aligned and variant called across 14 cloud and HPC environments; the processed data was then made available on the cloud for further analysis and sharing. If run locally, an operation at this scale would have monopolized a typical academic data centre for many months, and would have presented major challenges for data storage and distribution. However, this scale is increasingly typical for genomics projects and necessitates a rethink of how analytical tools are packaged and moved to the data. For PCAWG, we embraced the use of highly portable Docker images for encapsulating and sharing complex alignment and variant calling workflows across highly variable environments. While successful, this endeavor revealed a limitation in Docker containers, namely the lack of a standardized way to describe and execute the tools encapsulated inside the container. As a result, we created the Dockstore ( https://dockstore.org), a project that brings together Docker images with standardized, machine-readable ways of describing and running the tools contained within. This service greatly improves the sharing and reuse of genomics tools and promotes interoperability with similar projects through emerging web service standards developed by the Global Alliance for Genomics and Health (GA4GH).

摘要

随着基因组数据集持续增长,将数据下载到本地机构并在传统计算环境中运行分析的可行性正变得越来越成问题。当前的大型项目,如国际癌症基因组联盟全基因组泛癌分析(PCAWG)、美国精准医疗计划的数据平台以及美国国立卫生研究院大数据到知识转化基因组学中心,都在使用基于云的基础设施来托管和分析大型数据集。在PCAWG中,超过5800个人类全基因组在14个云环境和高性能计算环境中进行了比对和变异检测;然后将处理后的数据在云端提供,以供进一步分析和共享。如果在本地运行,如此规模的操作会使一个典型的学术数据中心被占用数月之久,并且会在数据存储和分发方面带来重大挑战。然而,这种规模在基因组学项目中越来越常见,因此有必要重新思考分析工具的打包方式以及如何将其迁移到数据所在之处。对于PCAWG,我们采用了高度可移植的Docker镜像,以便在高度可变的环境中封装和共享复杂的比对和变异检测工作流程。尽管这一尝试取得了成功,但也揭示了Docker容器的一个局限性,即缺乏一种标准化的方式来描述和执行容器内封装的工具。因此,我们创建了Dockstore(https://dockstore.org),该项目将Docker镜像与标准化的、机器可读的方式结合起来,用于描述和运行其中包含的工具。这项服务极大地改善了基因组学工具的共享和重用,并通过全球基因组学与健康联盟(GA4GH)制定的新兴网络服务标准促进了与类似项目的互操作性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4c4f/5333608/1c0ef09e3fce/f1000research-6-10919-g0000.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验