Masseroli Marco, Pinoli Pietro, Venco Francesco, Kaitoua Abdulrahman, Jalili Vahid, Palluzzi Fernando, Muller Heiko, Ceri Stefano
Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133, Milan and Center for Genomic Science of IIT@SEMM, Istituto Italiano di Tecnologia (IIT), 20139 Milan, Italy.
Bioinformatics. 2015 Jun 15;31(12):1881-8. doi: 10.1093/bioinformatics/btv048. Epub 2015 Feb 3.
Improvement of sequencing technologies and data processing pipelines is rapidly providing sequencing data, with associated high-level features, of many individual genomes in multiple biological and clinical conditions. They allow for data-driven genomic, transcriptomic and epigenomic characterizations, but require state-of-the-art 'big data' computing strategies, with abstraction levels beyond available tool capabilities.
We propose a high-level, declarative GenoMetric Query Language (GMQL) and a toolkit for its use. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous datasets and samples; as such it is key to genomic 'big data' analysis. GMQL leverages a simple data model that provides both abstractions of genomic region data and associated experimental, biological and clinical metadata and interoperability between many data formats. Based on Hadoop framework and Apache Pig platform, GMQL ensures high scalability, expressivity, flexibility and simplicity of use, as demonstrated by several biological query examples on ENCODE and TCGA datasets.
The GMQL toolkit is freely available for non-commercial use at http://www.bioinformatics.deib.polimi.it/GMQL/.
测序技术和数据处理流程的改进正在迅速提供多个生物和临床条件下许多个体基因组的测序数据以及相关的高级特征。这些数据可用于数据驱动的基因组、转录组和表观基因组特征分析,但需要最先进的“大数据”计算策略,其抽象级别超出了现有工具的能力范围。
我们提出了一种高级声明式基因组查询语言(GMQL)及其使用工具包。GMQL在原始数据预处理流程之后运行,支持对数千个异构数据集和样本进行查询;因此它是基因组“大数据”分析的关键。GMQL利用一种简单的数据模型,该模型既提供基因组区域数据的抽象,又提供相关的实验、生物和临床元数据,以及多种数据格式之间的互操作性。基于Hadoop框架和Apache Pig平台,GMQL确保了高可扩展性、表达性、灵活性和易用性,ENCODE和TCGA数据集上的几个生物学查询示例证明了这一点。
GMQL工具包可在http://www.bioinformatics.deib.polimi.it/GMQL/上免费用于非商业用途。