Dipartimento di Elettronica, Informazione e Bioingegneria, Via Ponzio 34/5, 20133, Milan, Italy.
BMC Bioinformatics. 2022 Apr 7;23(1):123. doi: 10.1186/s12859-022-04648-4.
Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures.
We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions.
RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.
高通量技术越来越多地收集了异构组学数据,这些数据可能隐藏着非常重要且尚未解决的生物医学问题的答案。这些数据的整合和处理对于下一代测序数据的三级分析至关重要,尽管合适的大数据策略主要还是针对一级和二级分析。因此,迫切需要专门设计用于探索大型组学数据集的算法,这些算法能够确保可扩展性和互操作性,可能依赖于高性能计算基础设施。
我们提出了 RGMQL,这是一个 R/Bioconductor 包,旨在提供一组专门的功能,用于从不同和不同本地化来源提取、组合、处理和比较组学数据集及其元数据。RGMQL 构建在 GenoMetric 查询语言(GMQL)数据管理和计算引擎之上,可以利用其开放的策展存储库及其基于云的资源,并且有可能将计算任务外包给 GMQL 远程服务。此外,它克服了 GMQL 声明式语法的限制,通过在 R/Bioconductor 环境中处理组学数据时保证了一种过程方法。但最重要的是,它与 R/Bioconductor 框架的其他包完全互操作,并在最常用的基因组数据结构和处理功能上具有可扩展性。
RGMQL 能够将 GMQL 的查询表达能力和计算效率与 R 环境中的完整处理流程相结合,是 R/Bioconductor 框架的完全集成扩展。在这里,我们提供了三个具有生物学相关性的完全可重现示例用例,特别说明了其使用灵活性和与其他 R/Bioconductor 包的互操作性。它们展示了 RGMQL 如何能够轻松地从本地扩展到并行和云计算,同时以用户完全透明的方式组合和分析来自本地或远程数据集(包括公共和私人数据集)的异构组学数据。