Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy.
The German Research Center for Artificial Intelligence (DFKI), Berlin, Germany.
Bioinformatics. 2019 Mar 1;35(5):729-736. doi: 10.1093/bioinformatics/bty688.
We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance.
The new system has a well-designed modular architecture featuring: (i) an intermediate representation supporting many different implementations (including Spark, Flink and SciDB); (ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database or others); (iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work.
The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/.
Supplementary data are available at Bioinformatics online.
我们之前提出了一种基于基因组数据模型(GDM)来协调现有数据格式的基因组数据管理范式转变,并提出了 GenoMetric 查询语言(GMQL)来支持下一代测序数据集的三级数据分析所需的高级数据提取和最常见的数据驱动计算。在这里,我们提出了一个具有增强的可访问性、可移植性、可扩展性和性能的新基于 GMQL 的系统。
新系统具有精心设计的模块化架构,其特点包括:(i)支持多种不同实现的中间表示(包括 Spark、Flink 和 SciDB);(ii)高级与技术无关的存储库抽象,支持不同的存储库技术(例如本地文件系统、Hadoop 文件系统、数据库或其他);(iii)几个系统接口,包括用户友好的基于 Web 的接口、Web 服务接口和用于 Python 语言的编程接口。使用公共 ENCODE、Roadmap Epigenomics 和 TCGA 数据集的生物学用例示例,证明了我们工作的相关性。
GMQL 系统作为开源项目免费提供给非商业用途,可在以下网址获得:http://www.bioinformatics.deib.polimi.it/GMQLsystem/。
补充数据可在生物信息学在线获得。