Milicchio Franco, Rose Rebecca, Bian Jiang, Min Jae, Prosperi Mattia
Department of Engineering, Roma Tre University, Rome, Italy.
Bioinfoexperts, LLC, Thibodaux, LA USA.
BioData Min. 2016 Apr 27;9:16. doi: 10.1186/s13040-016-0095-3. eCollection 2016.
High-throughput or next-generation sequencing (NGS) technologies have become an established and affordable experimental framework in biological and medical sciences for all basic and translational research. Processing and analyzing NGS data is challenging. NGS data are big, heterogeneous, sparse, and error prone. Although a plethora of tools for NGS data analysis has emerged in the past decade, (i) software development is still lagging behind data generation capabilities, and (ii) there is a 'cultural' gap between the end user and the developer.
Generic software template libraries specifically developed for NGS can help in dealing with the former problem, whilst coupling template libraries with visual programming may help with the latter. Here we scrutinize the state-of-the-art low-level software libraries implemented specifically for NGS and graphical tools for NGS analytics. An ideal developing environment for NGS should be modular (with a native library interface), scalable in computational methods (i.e. serial, multithread, distributed), transparent (platform-independent), interoperable (with external software interface), and usable (via an intuitive graphical user interface). These characteristics should facilitate both the run of standardized NGS pipelines and the development of new workflows based on technological advancements or users' needs. We discuss in detail the potential of a computational framework blending generic template programming and visual programming that addresses all of the current limitations.
In the long term, a proper, well-developed (although not necessarily unique) software framework will bridge the current gap between data generation and hypothesis testing. This will eventually facilitate the development of novel diagnostic tools embedded in routine healthcare.
高通量或新一代测序(NGS)技术已成为生物和医学科学领域中用于所有基础研究和转化研究的既定且经济实惠的实验框架。处理和分析NGS数据具有挑战性。NGS数据量大、异构、稀疏且容易出错。尽管在过去十年中出现了大量用于NGS数据分析的工具,但(i)软件开发仍落后于数据生成能力,并且(ii)终端用户与开发者之间存在“文化”差距。
专门为NGS开发的通用软件模板库有助于解决前一个问题,而将模板库与可视化编程相结合可能有助于解决后一个问题。在这里,我们仔细研究了专门为NGS实现的最新低级软件库以及用于NGS分析的图形工具。理想的NGS开发环境应该是模块化的(具有原生库接口)、在计算方法上可扩展的(即串行、多线程、分布式)、透明的(独立于平台)、可互操作的(具有外部软件接口)且易用的(通过直观的图形用户界面)。这些特性应有助于标准化NGS流程的运行以及基于技术进步或用户需求开发新的工作流程。我们详细讨论了一种融合通用模板编程和可视化编程的计算框架的潜力,该框架解决了当前所有的局限性。
从长远来看,一个合适的、开发完善的(尽管不一定是唯一的)软件框架将弥合当前数据生成与假设检验之间的差距。这最终将促进嵌入常规医疗保健中的新型诊断工具的开发。