Bianchi Valerio, Ceol Arnaud, Ogier Alessandro G E, de Pretis Stefano, Galeota Eugenia, Kishore Kamal, Bora Pranami, Croci Ottavio, Campaner Stefano, Amati Bruno, Morelli Marco J, Pelizzola Mattia
Center for Genomic Science of IIT@SEMM, Fondazione Istituto Italiano di Tecnologia Milano, Italy.
Department of Experimental Oncology, European Institute of Oncology Milano, Italy.
Front Genet. 2016 May 6;7:75. doi: 10.3389/fgene.2016.00075. eCollection 2016.
Next-generation sequencing (NGS) technologies have deeply changed our understanding of cellular processes by delivering an astonishing amount of data at affordable prices; nowadays, many biology laboratories have already accumulated a large number of sequenced samples. However, managing and analyzing these data poses new challenges, which may easily be underestimated by research groups devoid of IT and quantitative skills. In this perspective, we identify five issues that should be carefully addressed by research groups approaching NGS technologies. In particular, the five key issues to be considered concern: (1) adopting a laboratory management system (LIMS) and safeguard the resulting raw data structure in downstream analyses; (2) monitoring the flow of the data and standardizing input and output directories and file names, even when multiple analysis protocols are used on the same data; (3) ensuring complete traceability of the analysis performed; (4) enabling non-experienced users to run analyses through a graphical user interface (GUI) acting as a front-end for the pipelines; (5) relying on standard metadata to annotate the datasets, and when possible using controlled vocabularies, ideally derived from biomedical ontologies. Finally, we discuss the currently available tools in the light of these issues, and we introduce HTS-flow, a new workflow management system conceived to address the concerns we raised. HTS-flow is able to retrieve information from a LIMS database, manages data analyses through a simple GUI, outputs data in standard locations and allows the complete traceability of datasets, accompanying metadata and analysis scripts.
新一代测序(NGS)技术以可承受的价格提供了惊人数量的数据,深刻改变了我们对细胞过程的理解;如今,许多生物学实验室已经积累了大量测序样本。然而,管理和分析这些数据带来了新的挑战,缺乏信息技术和定量技能的研究团队可能很容易低估这些挑战。从这个角度来看,我们确定了研究团队在采用NGS技术时应仔细解决的五个问题。具体而言,需要考虑的五个关键问题是:(1)采用实验室管理系统(LIMS)并在下游分析中保护所得原始数据结构;(2)监控数据流并标准化输入和输出目录以及文件名,即使在对同一数据使用多种分析协议时也是如此;(3)确保所执行分析的完全可追溯性;(4)使没有经验的用户能够通过作为管道前端的图形用户界面(GUI)运行分析;(5)依靠标准元数据注释数据集,并尽可能使用受控词汇表,最好是源自生物医学本体的词汇表。最后,我们根据这些问题讨论了当前可用的工具,并介绍了HTS-flow,这是一个新的工作流程管理系统,旨在解决我们提出的问题。HTS-flow能够从LIMS数据库检索信息,通过简单的GUI管理数据分析,在标准位置输出数据,并允许数据集、伴随的元数据和分析脚本具有完全可追溯性。