Kim Baekdoo, Ali Thahmina, Lijeron Carlos, Afgan Enis, Krampis Konstantinos
Center for Translational and Basic Research and Belfer Research Building, Hunter College of The City University of New York, 413 E 69th St, New York, NY 10021.
Johns Hopkins University, Department of Biology, B3400 N Charles St, Mudd Hall 144, Baltimore MD 21218.
Gigascience. 2017 Aug 1;6(8):1-7. doi: 10.1093/gigascience/gix048.
Processing of next-generation sequencing (NGS) data requires significant technical skills, involving installation, configuration, and execution of bioinformatics data pipelines, in addition to specialized postanalysis visualization and data mining software. In order to address some of these challenges, developers have leveraged virtualization containers toward seamless deployment of preconfigured bioinformatics software and pipelines on any computational platform. We present an approach for abstracting the complex data operations of multistep, bioinformatics pipelines for NGS data analysis. As examples, we have deployed 2 pipelines for RNA sequencing and chromatin immunoprecipitation sequencing, preconfigured within Docker virtualization containers we call Bio-Docklets. Each Bio-Docklet exposes a single data input and output endpoint and from a user perspective, running the pipelines as simply as running a single bioinformatics tool. This is achieved using a "meta-script" that automatically starts the Bio-Docklets and controls the pipeline execution through the BioBlend software library and the Galaxy Application Programming Interface. The pipeline output is postprocessed by integration with the Visual Omics Explorer framework, providing interactive data visualizations that users can access through a web browser. Our goal is to enable easy access to NGS data analysis pipelines for nonbioinformatics experts on any computing environment, whether a laboratory workstation, university computer cluster, or a cloud service provider. Beyond end users, the Bio-Docklets also enables developers to programmatically deploy and run a large number of pipeline instances for concurrent analysis of multiple datasets.
处理下一代测序(NGS)数据需要很高的技术技能,除了专门的分析后可视化和数据挖掘软件外,还涉及生物信息学数据管道的安装、配置和执行。为了应对其中一些挑战,开发者利用虚拟化容器在任何计算平台上无缝部署预配置的生物信息学软件和管道。我们提出了一种方法,用于抽象用于NGS数据分析的多步骤生物信息学管道的复杂数据操作。例如,我们已经部署了2个用于RNA测序和染色质免疫沉淀测序的管道,这些管道在我们称为Bio-Docklets的Docker虚拟化容器中进行了预配置。每个Bio-Docklet都暴露一个单一的数据输入和输出端点,从用户角度来看,运行这些管道就像运行一个单一的生物信息学工具一样简单。这是通过一个“元脚本”实现的,该脚本自动启动Bio-Docklets,并通过BioBlend软件库和Galaxy应用程序编程接口控制管道执行。管道输出通过与Visual Omics Explorer框架集成进行后处理,提供用户可以通过网页浏览器访问的交互式数据可视化。我们的目标是让非生物信息学专家在任何计算环境中都能轻松访问NGS数据分析管道,无论是实验室工作站、大学计算机集群还是云服务提供商。除了终端用户,Bio-Docklets还使开发者能够以编程方式部署和运行大量管道实例,以便对多个数据集进行并发分析。